In this paper, we propose a novel attention module termed the Differentiable Channel Selection Attention module, or the DCS-Attention module. In contrast with conventional self-attention, the DCS-Attention module features selection of informative channels in the computation of the attention weights. The selection of the feature channels is performed in a differentiable manner, enabling seamless integration with DNN training. Our DCS-Attention is compatible with either fixed neural network backbones or learnable backbones with Differentiable Neural Architecture Search (DNAS), leading to DCS with Fixed Backbone (DCS-FB) and DCS-DNAS, respectively. Importantly, our DCS-Attention is motivated by the principle of Information Bottleneck (IB), and a novel variational upper bound for the IB loss, which can be optimized by SGD, is derived and incorporated into the training loss of the networks with the DCS-Attention modules. In this manner, a neural network with DCS-Attention modules is capable of selecting the most informative channels for feature extraction so that it enjoys state-of-the-art performance for the Re-ID task. Extensive experiments on multiple person Re-ID benchmarks using both DCS-FB and DCS-DNAS show that DCS-Attention significantly enhances the prediction accuracy of DNNs for person Re-ID, which demonstrates the effectiveness of DCS-Attention in learning discriminative features critical to identifying person identities. The code of our work is available at this https URL.
在这篇论文中,我们提出了一种称为可微通道选择注意力模块(Differentiable Channel Selection Attention module,简称DCS-Attention模块)的新型注意机制。与传统的自注意力机制不同,DCS-Attention模块在计算注意力权重时具有选择信息量丰富的通道的功能。该通道的选择过程是以一种可微分的方式进行的,从而能够无缝地集成到深度神经网络(DNN)的训练过程中。 我们的DCS-Attention模块既可以应用于固定结构的神经网络基础架构,也可以应用于通过可微神经体系结构搜索(Differentiable Neural Architecture Search, DNAS)学习得到的基础架构,分别称为具有固定骨干网的DCS (DCS-FB)和使用DNAS的DCS-DNAS。尤为重要的是,我们的DCS-Attention模块是基于信息瓶颈原则(Information Bottleneck, IB)设计的,并且我们还推导出一种新的适用于IB损失的变分上界,这个上界可以通过随机梯度下降(SGD)进行优化并集成到包含DCS-Attention模块的网络训练过程中。通过这种方式,在使用DCS-Attention模块的神经网络中,可以选出用于特征提取的信息量最大的通道,从而为重识别(Re-ID)任务提供最先进的性能。 我们在多个人员重识别基准数据集上进行了广泛实验,使用了既包括固定骨干网的DCS-FB也包括通过DNAS学习得到骨干网的DCS-DNAS方法,结果显示,DCS-Attention模块显著提高了深度神经网络在人员重识别中的预测准确性。这证明了DCS-Attention机制在学习区分性特征以准确辨别人的身份方面是非常有效的。 我们的工作代码可以在提供的链接处获取。
https://arxiv.org/abs/2505.08961
Gait recognition, known for its ability to identify individuals from a distance, has gained significant attention in recent times due to its non-intrusive verification. While video-based gait identification systems perform well on large public datasets, their performance drops when applied to real-world, unconstrained gait data due to various factors. Among these, uncontrolled outdoor environments, non-overlapping camera views, varying illumination, and computational efficiency are core challenges in gait-based authentication. Currently, no dataset addresses all these challenges simultaneously. In this paper, we propose an OptiGait-LGBM model capable of recognizing person re-identification under these constraints using a skeletal model approach, which helps mitigate inconsistencies in a person's appearance. The model constructs a dataset from landmark positions, minimizing memory usage by using non-sequential data. A benchmark dataset, RUET-GAIT, is introduced to represent uncontrolled gait sequences in complex outdoor environments. The process involves extracting skeletal joint landmarks, generating numerical datasets, and developing an OptiGait-LGBM gait classification model. Our aim is to address the aforementioned challenges with minimal computational cost compared to existing methods. A comparative analysis with ensemble techniques such as Random Forest and CatBoost demonstrates that the proposed approach outperforms them in terms of accuracy, memory usage, and training time. This method provides a novel, low-cost, and memory-efficient video-based gait recognition solution for real-world scenarios.
步态识别因其能够从远处识别个人而不侵扰的特点,在近期引起了广泛关注。基于视频的步态识别系统在大型公共数据集上表现出色,但在实际应用中面对无约束环境时性能下降,这主要是由于不受控的户外环境、摄像机视角不一致、光照变化以及计算效率低下等因素造成的。目前尚无数据集同时解决这些挑战。 本文提出了一种名为OptiGait-LGBM的模型,该模型能够在上述限制条件下进行人员再识别,并使用骨骼模型的方法来减少外观上的不一致性。通过利用关键点的位置构建数据集,该方法能够降低内存消耗并处理非连续的数据。我们还引入了一个基准数据集RUET-GAIT,用于表示复杂户外环境中不受控的步态序列。研究过程包括提取骨骼关节的关键点、生成数值化数据集,并开发OptiGait-LGBM步态分类模型。 我们的目标是在与现有方法相比具有较低计算成本的情况下解决上述挑战。通过与随机森林和CatBoost等集成技术进行比较分析,证明了我们提出的方法在准确率、内存使用量以及训练时间方面均优于这些方法。这种方法提供了一种新颖的、低成本且内存高效的视频步态识别解决方案,适用于实际场景中的应用。
https://arxiv.org/abs/2505.08801
Video surveillance image analysis and processing is a challenging field in computer vision, with one of its most difficult tasks being Person Re-Identification (PRe-ID). PRe-ID aims to identify and track target individuals who have already been detected in a network of cameras, using a robust description of their pedestrian images. The success of recent research in person PRe-ID is largely due to effective feature extraction and representation, as well as the powerful learning of these features to reliably discriminate between pedestrian images. To this end, two powerful features, Convolutional Neural Networks (CNN) and Local Maximal Occurrence (LOMO), are modeled on multidimensional data using the proposed method, High-Dimensional Feature Fusion (HDFF). Specifically, a new tensor fusion scheme is introduced to leverage and combine these two types of features in a single tensor, even though their dimensions are not identical. To enhance the system's accuracy, we employ Tensor Cross-View Quadratic Analysis (TXQDA) for multilinear subspace learning, followed by cosine similarity for matching. TXQDA efficiently facilitates learning while reducing the high dimensionality inherent in high-order tensor data. The effectiveness of our approach is verified through experiments on three widely-used PRe-ID datasets: VIPeR, GRID, and PRID450S. Extensive experiments demonstrate that our approach outperforms recent state-of-the-art methods.
视频监控图像分析和处理是计算机视觉领域中的一个挑战性课题,其中最难的任务之一就是行人再识别(Person Re-Identification,PRe-ID)。PRe-ID的目标是在由多个摄像头组成的网络中对已经检测到的特定个人进行识别与追踪。这需要通过强有力的描述来刻画行人的图像特征。近年来,PRe-ID研究的成功很大程度上归功于有效的特征提取和表示方法,以及这些特征的强大学习能力,使其能够可靠地区分行人图像。 为此,本文提出了一种高维特征融合(High-Dimensional Feature Fusion, HDFF)的方法,在这种方法中,利用两种强大的特征——卷积神经网络(Convolutional Neural Networks, CNN)和局部最大发生率(Local Maximal Occurrence, LOMO)来建模多维度数据。具体来说,我们引入了一种新的张量融合方案,即使这两种特征的维度不相同,也能将它们组合成一个单一的张量中进行处理。为了提高系统的准确性,我们在高阶张量数据固有的高维性学习过程中采用了张量跨视图二次分析(Tensor Cross-View Quadratic Analysis, TXQDA)和余弦相似度匹配的方法来进行多线性子空间学习。 我们通过在三个广泛使用的PRe-ID数据集:VIPeR、GRID和PRID450S上进行实验,验证了该方法的有效性。大量的实验证明我们的方法优于近期最先进的方法。
https://arxiv.org/abs/2505.15825
Visible-infrared person re-identification (VIReID) provides a solution for ReID tasks in 24-hour scenarios; however, significant challenges persist in achieving satisfactory performance due to the substantial discrepancies between visible (VIS) and infrared (IR) modalities. Existing methods inadequately leverage information from different modalities, primarily focusing on digging distinguishing features from modality-shared information while neglecting modality-specific details. To fully utilize differentiated minutiae, we propose a Base-Detail Feature Learning Framework (BDLF) that enhances the learning of both base and detail knowledge, thereby capitalizing on both modality-shared and modality-specific information. Specifically, the proposed BDLF mines detail and base features through a lossless detail feature extraction module and a complementary base embedding generation mechanism, respectively, supported by a novel correlation restriction method that ensures the features gained by BDLF enrich both detail and base knowledge across VIS and IR features. Comprehensive experiments conducted on the SYSU-MM01, RegDB, and LLCM datasets validate the effectiveness of BDLF.
可见光-红外行人重识别(VIReID)为全天候场景下的重识别任务提供了一种解决方案;然而,由于可见光(VIS)和红外线(IR)模态之间的显著差异,要达到令人满意的性能仍然面临着重大挑战。现有的方法未能充分利用不同模态的信息,主要集中在挖掘模态共享信息中的区分特征,而忽略了模态特有的细节。 为了充分利用差异化的小细节,我们提出了一种基元-细节特征学习框架(BDLF),该框架增强了基础和详细知识的学习能力,并且能够同时利用模态共享和特定于某个模态的信息。具体来说,提出的BDLF通过无损的细节特征提取模块挖掘细节特征,并使用互补的基础嵌入生成机制来获取基础特征,这些都得到了一种新颖的相关性限制方法的支持,该方法确保了由BDLF获得的特征能够丰富跨VIS和IR特征的细节与基础知识。 在SYSU-MM01、RegDB 和LLCM数据集上进行的全面实验验证了BDLF的有效性。
https://arxiv.org/abs/2505.03286
Unsupervised visible-infrared person re-identification (UVI-ReID) aims to retrieve pedestrian images across different modalities without costly annotations, but faces challenges due to the modality gap and lack of supervision. Existing methods often adopt self-training with clustering-generated pseudo-labels but implicitly assume these labels are always correct. In practice, however, this assumption fails due to inevitable pseudo-label noise, which hinders model learning. To address this, we introduce a new learning paradigm that explicitly considers Pseudo-Label Noise (PLN), characterized by three key challenges: noise overfitting, error accumulation, and noisy cluster correspondence. To this end, we propose a novel Robust Duality Learning framework (RoDE) for UVI-ReID to mitigate the effects of noisy pseudo-labels. First, to combat noise overfitting, a Robust Adaptive Learning mechanism (RAL) is proposed to dynamically emphasize clean samples while down-weighting noisy ones. Second, to alleviate error accumulation-where the model reinforces its own mistakes-RoDE employs dual distinct models that are alternately trained using pseudo-labels from each other, encouraging diversity and preventing collapse. However, this dual-model strategy introduces misalignment between clusters across models and modalities, creating noisy cluster correspondence. To resolve this, we introduce Cluster Consistency Matching (CCM), which aligns clusters across models and modalities by measuring cross-cluster similarity. Extensive experiments on three benchmarks demonstrate the effectiveness of RoDE.
无监督可见光-红外行人再识别(UVI-ReID)的目标是在无需昂贵注释的情况下,在不同模态下检索行人的图像,但面临着由于模态差距和缺乏监督带来的挑战。现有的方法通常采用自训练技术,使用聚类生成的伪标签,但却隐含地假设这些标签总是正确的。然而在实践中,这一假设会因不可避免的伪标签噪声而失败,从而阻碍模型的学习。为了解决这个问题,我们引入了一种新的学习范式,该范式明确考虑了伪标签噪声(PLN),其特征包括三个关键挑战:噪声过拟合、错误积累和噪音聚类对应关系。 为此,我们提出了一种名为鲁棒二元学习框架(RoDE)的新方法来减轻噪声音频伪标签对UVI-ReID的影响。首先,为了对抗噪声过拟合,提出了一个鲁棒自适应学习机制(RAL),该机制可以动态地强调干净样本的同时抑制噪音样本的权重。 其次,为了解决错误积累的问题——即模型强化自身的错误——RoDE采用了两种不同的模型交替训练的方法,每个模型使用另一个模型产生的伪标签进行训练。这种方法鼓励多样性,并防止模型崩溃。然而,这种双模型策略在不同模型和模态之间引入了聚类错配问题,从而产生了噪音的聚类对应关系。 为了解决这一问题,我们提出了聚类一致性匹配(CCM)方法,通过测量跨集群相似性来对齐不同模型和模态之间的聚类。大量的实验显示,在三个基准测试上RoDE的表现优于其他现有技术,证明了其有效性和鲁棒性。
https://arxiv.org/abs/2505.02549
Practical applications of computer vision in smart cities usually assume system integration and operation in challenging open-world environments. In the case of person re-identification task the main goal is to retrieve information whether the specific person has appeared in another place at a different time instance of the same video, or over multiple camera feeds. This typically assumes collecting raw data from video surveillance cameras in different places and under varying illumination conditions. In the considered open-world setting it also requires detection and localization of the person inside the analyzed video frame before the main re-identification step. With multi-person and multi-camera setups the system complexity becomes higher, requiring sophisticated tracking solutions and re-identification models. In this work we will discuss existing challenges in system design architectures, consider possible solutions based on different computer vision techniques, and describe applications of such systems in retail stores and public spaces for improved marketing analytics. In order to analyse sensitivity of person re-identification task under different open-world environments, a performance of one close to real-time solution will be demonstrated over several video captures and live camera feeds. Finally, based on conducted experiments we will indicate further research directions and possible system improvements.
计算机视觉在智慧城市中的实际应用通常假设系统会在具有挑战性的开放世界环境中进行集成和运行。以人员重识别任务为例,主要目标是从同一视频的不同时段或不同摄像机流中检索特定人物是否出现在另一个地方的信息。这通常需要从不同地点、在不同的光照条件下收集监控摄像头的原始数据。在这种开放世界的设定下,还需要在分析的视频帧内检测和定位人物,然后才能进行主要的重识别步骤。当涉及多人和多相机设置时,系统的复杂性会增加,需要更复杂的跟踪解决方案和重识别模型。 在这项工作中,我们将讨论系统设计架构中存在的现有挑战,并基于不同的计算机视觉技术考虑可能的解决方案。此外,我们还将描述此类系统在零售商店和公共场所的应用,以改进市场营销分析。为了分析人员重识别任务在不同开放世界环境下的敏感性,将演示一种接近实时的解决方案在多个视频片段和现场摄像机流中的性能表现。最后,根据进行的实验,我们将指出进一步的研究方向以及可能的系统改进措施。 这一工作不仅涵盖了技术挑战,还强调了计算机视觉如何用于解决实际问题,特别是在智慧城市环境中提升数据收集、分析及应用的有效性方面。
https://arxiv.org/abs/2505.00772
Visible-Infrared Person Re-Identification (VI-ReID) is a challenging task due to the large modality discrepancy between visible and infrared images, which complicates the alignment of their features into a suitable common space. Moreover, style noise, such as illumination and color contrast, reduces the identity discriminability and modality invariance of features. To address these challenges, we propose a novel Diverse Semantics-guided Feature Alignment and Decoupling (DSFAD) network to align identity-relevant features from different modalities into a textual embedding space and disentangle identity-irrelevant features within each modality. Specifically, we develop a Diverse Semantics-guided Feature Alignment (DSFA) module, which generates pedestrian descriptions with diverse sentence structures to guide the cross-modality alignment of visual features. Furthermore, to filter out style information, we propose a Semantic Margin-guided Feature Decoupling (SMFD) module, which decomposes visual features into pedestrian-related and style-related components, and then constrains the similarity between the former and the textual embeddings to be at least a margin higher than that between the latter and the textual embeddings. Additionally, to prevent the loss of pedestrian semantics during feature decoupling, we design a Semantic Consistency-guided Feature Restitution (SCFR) module, which further excavates useful information for identification from the style-related features and restores it back into the pedestrian-related features, and then constrains the similarity between the features after restitution and the textual embeddings to be consistent with that between the features before decoupling and the textual embeddings. Extensive experiments on three VI-ReID datasets demonstrate the superiority of our DSFAD.
可见光-红外重识别(VI-ReID)是一项具有挑战性的任务,由于可见光和红外图像之间的模态差异较大,使得将它们的特征对齐到一个合适的公共空间变得复杂。此外,如光照和色彩对比等风格噪声降低了身份判别性和跨模态不变性。为了解决这些挑战,我们提出了一种新颖的基于多样化语义指导的特征对齐与解耦(DSFAD)网络,旨在将不同模态中的身份相关特征对齐到文本嵌入空间,并在每种模态内部分离出身份无关的特征。 具体而言,我们开发了一个基于多样化语义引导的特征对齐模块(DSFA),该模块生成具有多样句式结构的人行道描述以指导视觉特征之间的跨模态对齐。此外,为了过滤掉风格信息,我们提出了一种基于语义间隔导向的特征解耦模块(SMFD),将视觉特征分解为人行道相关和风格相关的部分,并且限制前者与文本嵌入相似度至少高于后者与文本嵌入的相似度一定阈值之上。另外,为防止在特征解耦过程中丢失行人语义信息,我们设计了一种基于语义一致性导向的特征恢复模块(SCFR),该模块进一步从风格相关特征中挖掘出识别有用的信息并将其恢复到人行道相关特征中,并且限制了经过恢复后与文本嵌入相似性的保持一致性。 在三个VI-ReID数据集上的大量实验结果证明了我们DSFAD网络的优越性。
https://arxiv.org/abs/2505.00619
Unsupervised visible-infrared person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning. Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning. However, these methods overlook the cross-modality variations in feature representation and pseudo-label distributions brought by fine-grained patterns. This insight results in insufficient modality-shared learning when only global features are optimized. To address this issue, we propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up optimization objective for specific fine-grained patterns emphasized by each modality, thereby achieving complementary alignment between the label distributions of different modalities. Specifically, we first introduce a Dual Association with Global Learning (DAGI) module to unify the pseudo-labels of cross-modality instances in a bi-directional manner. Afterward, a Fine-Grained Semantic-Aligned Learning (FGSAL) module is carried out to explore part-level semantic-aligned patterns emphasized by each modality from cross-modality instances. Optimization objective is then formulated based on the semantic-aligned features and their corresponding label space. To alleviate the side-effects arising from noisy pseudo-labels, we propose a Global-Part Collaborative Refinement (GPCR) module to mine reliable positive sample sets for the global and part features dynamically and optimize the inter-instance relationships. Extensive experiments demonstrate the effectiveness of the proposed method, which achieves superior performances to state-of-the-art methods. Our code is available at \href{this https URL}.
无监督可见光-红外行人重识别(USL-VI-ReID)旨在匹配不同模态下同一行人的图像,而不依赖于人工标注以训练模型。以往的方法通过标签关联算法统一跨模态图像的伪标签,并设计对比学习框架进行全局特征学习。然而,这些方法忽视了由细粒度模式带来的跨模态在特征表示和伪标签分布上的变化。这种认识导致仅优化全局特征时模态共享学习不足。为解决这一问题,我们提出了一种语义对齐协作细化(SALCR)框架,该框架建立了一个针对每个模态强调的特定细粒度模式进行优化的目标,从而实现不同模态之间标签分布的互补对齐。 具体来说,首先引入了双关联全局学习(DAGI)模块以双向方式统一跨模态实例的伪标签。随后,执行细粒度语义对齐学习(FGSAL)模块,从跨模态实例中探索每个模态强调的部分级语义对齐模式。然后基于语义对齐特征及其对应的标签空间来制定优化目标。 为了缓解由于噪声伪标签带来的副作用,我们提出了一种全局-部分协作细化(GPCR)模块,该模块动态挖掘出对于全局和局部特征来说可靠的正样本集,并优化实例间的关联关系。 广泛的实验表明了所提出方法的有效性,在无监督可见光-红外行人重识别任务上达到了优于当前最优方法的性能。我们的代码可在\href{this https URL}获取。
https://arxiv.org/abs/2504.19244
Visible-Infrared Person Re-identification (VIReID) aims to match visible and infrared pedestrian images, but the modality differences and the complexity of identity features make it challenging. Existing methods rely solely on identity label supervision, which makes it difficult to fully extract high-level semantic information. Recently, vision-language pre-trained models have been introduced to VIReID, enhancing semantic information modeling by generating textual descriptions. However, such methods do not explicitly model body shape features, which are crucial for cross-modal matching. To address this, we propose an effective Body Shape-aware Textual Alignment (BSaTa) framework that explicitly models and utilizes body shape information to improve VIReID performance. Specifically, we design a Body Shape Textual Alignment (BSTA) module that extracts body shape information using a human parsing model and converts it into structured text representations via CLIP. We also design a Text-Visual Consistency Regularizer (TVCR) to ensure alignment between body shape textual representations and visual body shape features. Furthermore, we introduce a Shape-aware Representation Learning (SRL) mechanism that combines Multi-text Supervision and Distribution Consistency Constraints to guide the visual encoder to learn modality-invariant and discriminative identity features, thus enhancing modality invariance. Experimental results demonstrate that our method achieves superior performance on the SYSU-MM01 and RegDB datasets, validating its effectiveness.
可见光-红外行人重识别(VIReID)旨在匹配可见光和红外图像中的行人,但由于模态差异以及身份特征的复杂性,这使得任务具有挑战性。现有的方法主要依赖于身份标签监督来提取高层语义信息,这种方法难以全面利用高阶语义信息。最近,视觉-语言预训练模型被引入到VIReID中,通过生成文本描述来增强语义信息建模。然而,这些方法没有显式地对体形特征进行建模,而这对于跨模态匹配至关重要。 为了解决这个问题,我们提出了一种有效的基于体形的文本对齐(Body Shape-aware Textual Alignment, BSaTa)框架,该框架通过利用和建模体形信息来提高VIReID的表现。具体来说,我们设计了一个体形文本对齐模块(BSTA),它使用人体解析模型提取体形特征,并通过CLIP将其转换为结构化文本表示形式。此外,我们还设计了一种基于文本-视觉一致性的正则器(Text-Visual Consistency Regularizer, TVCR)来确保体形文本表示与视觉中的体形特征对齐。 为了进一步提升表现,我们引入了形状感知表征学习机制(Shape-aware Representation Learning, SRL),该机制结合了多文本监督和分布一致性约束,指导视觉编码器学习模态不变且具有判别性的身份特征。实验结果表明,在SYSU-MM01和RegDB数据集上,我们的方法取得了优越的表现,验证了其有效性。 这种新的框架不仅增强了语义信息建模,还通过显式地考虑体形特征来提高了跨模态匹配的准确性,并在一定程度上解决了不同模态之间的不一致性问题。
https://arxiv.org/abs/2504.18025
Lifelong Person Re-identification (LReID) suffers from a key challenge in preserving old knowledge while adapting to new information. The existing solutions include rehearsal-based and rehearsal-free methods to address this challenge. Rehearsal-based approaches rely on knowledge distillation, continuously accumulating forgetting during the distillation process. Rehearsal-free methods insufficiently learn the distribution of each domain, leading to forgetfulness over time. To solve these issues, we propose a novel Distribution-aware Forgetting Compensation (DAFC) model that explores cross-domain shared representation learning and domain-specific distribution integration without using old exemplars or knowledge distillation. We propose a Text-driven Prompt Aggregation (TPA) that utilizes text features to enrich prompt elements and guide the prompt model to learn fine-grained representations for each instance. This can enhance the differentiation of identity information and establish the foundation for domain distribution awareness. Then, Distribution-based Awareness and Integration (DAI) is designed to capture each domain-specific distribution by a dedicated expert network and adaptively consolidate them into a shared region in high-dimensional space. In this manner, DAI can consolidate and enhance cross-domain shared representation learning while alleviating catastrophic forgetting. Furthermore, we develop a Knowledge Consolidation Mechanism (KCM) that comprises instance-level discrimination and cross-domain consistency alignment strategies to facilitate model adaptive learning of new knowledge from the current domain and promote knowledge consolidation learning between acquired domain-specific distributions, respectively. Experimental results show that our DAFC outperform state-of-the-art methods by at least 9.8\%/6.6\% and 6.4\%/6.2\% of average mAP/R@1 on two training orders.
终身人员再识别(LReID)面临着一个关键挑战,即在适应新信息的同时保持旧知识。现有的解决方案包括基于重播的方法和无重播方法来应对这一挑战。基于重播的方法依赖于知识蒸馏,在这个过程中会不断积累遗忘。而无需重播的方法未能充分学习每个领域的分布,导致随着时间的推移出现记忆丧失问题。为了解决这些问题,我们提出了一种新颖的分布感知遗忘补偿(DAFC)模型,该模型探索了跨域共享表示学习以及领域特定分布整合,而不使用旧样本或知识蒸馏。 此外,我们提出了文本驱动的提示聚合(TPA),利用文本特征丰富提示元素,并引导提示模型为每个实例学习精细粒度表示。这可以增强身份信息的区别性,并为领域分布感知建立基础。然后设计了基于分布的意识和整合(DAI)机制,通过专用专家网络捕捉每个领域的特定分布并自适应地将它们合并到高维空间中的共享区域中。以此方式,DAI能够在缓解灾难性遗忘的同时巩固和增强跨域共享表示学习。 此外,我们开发了一种知识固化机制(KCM),该机制包含实例级区分策略和跨领域一致性对齐策略,以促进模型从当前领域适应性地学习新知识,并推动已获得领域的特定分布之间的知识整合学习。 实验结果表明,在两种训练顺序下,我们的DAFC比最先进的方法至少提高了9.8%/6.6%(平均mAP/R@1)和6.4%/6.2%的性能。
https://arxiv.org/abs/2504.15041
This paper introduces a novel dual-region augmentation approach designed to reduce reliance on large-scale labeled datasets while improving model robustness and adaptability across diverse computer vision tasks, including source-free domain adaptation (SFDA) and person re-identification (ReID). Our method performs targeted data transformations by applying random noise perturbations to foreground objects and spatially shuffling background patches. This effectively increases the diversity of the training data, improving model robustness and generalization. Evaluations on the PACS dataset for SFDA demonstrate that our augmentation strategy consistently outperforms existing methods, achieving significant accuracy improvements in both single-target and multi-target adaptation settings. By augmenting training data through structured transformations, our method enables model generalization across domains, providing a scalable solution for reducing reliance on manually annotated datasets. Furthermore, experiments on Market-1501 and DukeMTMC-reID datasets validate the effectiveness of our approach for person ReID, surpassing traditional augmentation techniques.
本文介绍了一种新颖的双区域增强方法,旨在减少对大规模标注数据集的依赖,同时提高模型在包括无源领域适应(SFDA)和人员再识别(ReID)等多样化计算机视觉任务中的鲁棒性和适应性。我们的方法通过向前景对象添加随机噪声扰动并空间地打乱背景块来进行有针对性的数据转换。这有效地增加了训练数据的多样性,提高了模型的稳健性和泛化能力。 在PACS数据集上的SFDA评估表明,我们的增强策略始终优于现有的方法,在单目标和多目标适应设置中均实现了显著的准确率提升。通过结构化的变换来扩充训练数据,我们的方法使模型能够跨领域进行推广,并提供了一种可扩展的解决方案以减少对人工标注数据集的依赖。 此外,在Market-1501和DukeMTMC-reID数据集上的实验验证了我们方法在人员ReID任务中的有效性,超过了传统的增强技术。
https://arxiv.org/abs/2504.13077
Person re-identification (Re-ID) aims to match the same pedestrian in a large gallery with different cameras and views. Enhancing the robustness of the extracted feature representations is a main challenge in Re-ID. Existing methods usually improve feature representation by improving model architecture, but most methods ignore the potential contextual information, which limits the effectiveness of feature representation and retrieval performance. Neighborhood information, especially the potential information of multi-order neighborhoods, can effectively enrich feature expression and improve retrieval accuracy, but this has not been fully explored in existing research. Therefore, we propose a novel model DMON-ARO that leverages latent neighborhood information to enhance both feature representation and index performance. Our approach is built on two complementary modules: Dynamic Multi-Order Neighbor Modeling (DMON) and Asymmetric Relationship Optimization (ARO). The DMON module dynamically aggregates multi-order neighbor relationships, allowing it to capture richer contextual information and enhance feature representation through adaptive neighborhood modeling. Meanwhile, ARO refines the distance matrix by optimizing query-to-gallery relationships, improving the index accuracy. Extensive experiments on three benchmark datasets demonstrate that our approach achieves performance improvements against baseline models, which illustrate the effectiveness of our model. Specifically, our model demonstrates improvements in Rank-1 accuracy and mAP. Moreover, this method can also be directly extended to other re-identification tasks.
人再识别(Re-ID)的目标是在不同摄像头和视角的大图库中匹配同一行人。提高提取特征表示的鲁棒性是Re-ID中的主要挑战之一。现有方法通常通过改进模型架构来提升特征表示,但大多数方法忽视了潜在的上下文信息,这限制了特征表示的有效性和检索性能。邻域信息特别是多阶邻居的潜在信息能够有效丰富特征表达并提高检索精度,但在现有研究中这一潜力尚未得到充分挖掘。 因此,我们提出了一种新型模型DMON-ARO(Dynamic Multi-Order Neighbor Modeling and Asymmetric Relationship Optimization),该模型利用隐式邻域信息来增强特征表示和索引性能。我们的方法基于两个互补模块:动态多阶邻居建模(DMON)和非对称关系优化(ARO)。DMON模块通过自适应地聚合多阶邻居关系,能够捕捉更丰富的上下文信息,并通过灵活的邻居建模增强特征表示能力。同时,ARO模块通过对查询到图库的关系进行优化来精炼距离矩阵,进而提高索引准确性。 在三个基准数据集上的广泛实验表明,我们的方法相比基线模型取得了性能改进,这说明了我们模型的有效性。具体而言,我们在Rank-1准确率和mAP方面均有提升。此外,该方法也可以直接扩展到其他再识别任务中。
https://arxiv.org/abs/2504.11798
Transit Origin-Destination (OD) data are essential for transit planning, particularly in route optimization and demand-responsive paratransit systems. Traditional methods, such as manual surveys, are costly and inefficient, while Bluetooth and WiFi-based approaches require passengers to carry specific devices, limiting data coverage. On the other hand, most transit vehicles are equipped with onboard cameras for surveillance, offering an opportunity to repurpose them for edge-based OD data collection through visual person re-identification (ReID). However, such approaches face significant challenges, including severe occlusion and viewpoint variations in transit environments, which greatly reduce matching accuracy and hinder their adoption. Moreover, designing effective algorithms that can operate efficiently on edge devices remains an open challenge. To address these challenges, we propose TransitReID, a novel framework for individual-level transit OD data collection. TransitReID consists of two key components: (1) An occlusion-robust ReID algorithm featuring a variational autoencoder guided region-attention mechanism that adaptively focuses on visible body regions through reconstruction loss-optimized weight allocation; and (2) a Hierarchical Storage and Dynamic Matching (HSDM) mechanism specifically designed for efficient and robust transit OD matching which balances storage, speed, and accuracy. Additionally, a multi-threaded design supports near real-time operation on edge devices, which also ensuring privacy protection. We also introduce a ReID dataset tailored for complex bus environments to address the lack of relevant training data. Experimental results demonstrate that TransitReID achieves state-of-the-art performance in ReID tasks, with an accuracy of approximately 90\% in bus route simulations.
翻译如下: 转换起点-终点(OD)数据对于公共交通规划至关重要,特别是在路线优化和需求响应型辅助交通系统中。传统的手动调查方法成本高且效率低下,而基于蓝牙和WiFi的方法则需要乘客携带特定设备,从而限制了数据的覆盖面。相比之下,大多数公共交通工具都配备了用于监控的车载摄像头,这为通过视觉行人再识别(ReID)重新利用这些设备来收集边端OD数据提供了机会。然而,这种方法面临着严重的挑战,包括在公共交通环境中出现严重遮挡和视角变化的问题,这些问题大大降低了匹配准确性,并阻碍了其采用。此外,在边缘设备上运行的高效算法的设计仍然是一个开放性的难题。为了解决这些挑战,我们提出了TransitReID,这是一个用于个人级别公共交通OD数据收集的新框架。TransitReID包含两个关键组件:(1)一种鲁棒于遮挡的ReID算法,该算法结合了变分自编码器指导下的区域注意力机制,通过优化重建损失权重分配来适应性地关注可见的身体部位;(2)一个专门为高效和稳健的公共交通OD匹配设计的层次化存储与动态匹配(HSDM)机制,该机制平衡了存储、速度和准确性的需求。此外,一个多线程的设计支持边缘设备上的近实时操作,并确保隐私保护。我们还引入了一个专为复杂公交环境定制的ReID数据集,以解决相关训练数据不足的问题。实验结果表明,TransitReID在ReID任务中达到了最先进的性能,在公交车路线模拟中的准确率约为90%。
https://arxiv.org/abs/2504.11500
Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.
传统的基于文本的人再识别(ReID)任务假设目击者提供的描述是完整且一次性给出的。然而,在现实场景中,这些描述往往不完整或模糊不清。为了解决这一局限性,我们引入了一个新的任务——互动式人再识别(Inter-ReID)。Inter-ReID 是一种基于对话的检索任务,通过与目击者的持续交互来逐步完善初始描述。 为了促进这项新任务的研究,我们构建了一个包含多种类型问题的对话数据集,这些问题通过对个体进行细粒度属性分解而生成。此外,我们还提出了一种名为LLaVA-ReID的问题模型,该模型基于视觉和文本上下文生成有针对性的问题,以获取关于目标人物的额外信息。通过采用面向未来的策略,在训练过程中优先选择最具有信息量的问题作为监督。 在交互式ReID和传统基于文本的ReID基准测试上的实验结果表明,LLaVA-ReID显著优于现有基线方法。
https://arxiv.org/abs/2504.10174
Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative ReID models to maintain identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust network is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's capability to represent persons. To address these issues, we propose a novel two-stage feature learning framework named SD-ReID for AG-ReID, which takes advantage of the powerful understanding capacity of generative models, e.g., Stable Diffusion (SD), to generate view-specific features between different viewpoints. In the first stage, we train a simple ViT-based model to extract coarse-grained representations and controllable conditions. Then, in the second stage, we fine-tune the SD model to learn complementary representations guided by the controllable conditions. Furthermore, we propose the View-Refine Decoder (VRD) to obtain additional controllable conditions to generate missing cross-view features. Finally, we use the coarse-grained representations and all-view features generated by SD to retrieve target persons. Extensive experiments on the AG-ReID benchmarks demonstrate the effectiveness of our proposed SD-ReID. The source code will be available upon acceptance.
Aerial-Ground Person Re-IDentification(AG-ReID)旨在跨不同视角的摄像头检索特定人物。之前的研究主要集中在设计具有区分性的ReID模型,以维持身份的一致性,即使在摄像机视点发生剧烈变化的情况下也不例外。这些方法的核心思想非常自然,但设计一个对视图鲁棒的网络是一项极具挑战性的任务。此外,它们忽略了视角特有特征对于提升模型表示人物能力的作用。 为了解决这些问题,我们提出了一种新的两阶段特征学习框架SD-ReID用于AG-ReID,该框架利用生成模型(如Stable Diffusion (SD))的强大理解能力来生成不同视点之间的视角特定特征。在第一阶段中,我们训练一个基于ViT的简单模型以提取粗粒度表示和可控条件。然后,在第二阶段中,我们在可控条件下引导微调SD模型学习互补表示。此外,我们提出了视图细化解码器(VRD),用于获取额外的可控条件来生成缺失的跨视角特征。最后,我们使用由SD生成的所有视角特征以及粗粒度表示来进行目标人物检索。 在AG-ReID基准测试上的大量实验表明了我们提出的SD-ReID的有效性。源代码将在接受后提供。
https://arxiv.org/abs/2504.09549
Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different views. Previous methods usually adopt large-scale models, focusing on view-invariant features. However, they overlook the semantic information in person attributes. Additionally, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge. More specifically, we first introduce the Contrastive Language-Image Pre-training (CLIP) model as the backbone, and propose an Attribute-aware Image Encoder (AIE) to extract global semantic features and attribute-aware features. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to generate person attribute predictions and obtain the encoded representations of predicted attributes. Finally, we design a Coupled Prompt Template (CPT) to transform attribute tokens and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve the AG-ReID. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed LATex. The source code will be available.
空中-地面人员重识别(AG-ReID)旨在检索跨不同视角中异构摄像机中的特定人员。以往的方法通常采用大规模模型,专注于视图不变特征。然而,它们忽略了人体属性中的语义信息。此外,现有的训练策略往往依赖于对大规模模型进行完全微调,这大大增加了训练成本。为了解决这些问题,我们提出了一种名为LATex的新框架用于AG-ReID,该框架采用了提示微调策略以利用基于属性的文本知识。 更具体地说,首先我们将对比语言图像预训练(CLIP)模型作为骨干网络,并提出了一个感知属性的图像编码器(AIE),用于提取全局语义特征和感知属性的特征。然后,利用这些特征,我们提出了一种提示属性分类器组(PACG)以生成人员属性预测并获取预测属性的编码表示。最后,我们设计了一个耦合提示模板(CPT)将属性标记和视图信息转换为结构化句子。这些句子通过CLIP中的文本编码器处理,从而产生更具判别性的特征。 因此,我们的框架可以充分利用基于属性的文本知识以改进AG-ReID性能。在三个AG-ReID基准数据集上的广泛实验展示了我们提出的LATex的有效性。源代码将公开提供。
https://arxiv.org/abs/2503.23722
Clothes-changing person re-identification (CC-ReID) aims to recognize individuals under different clothing scenarios. Current CC-ReID approaches either concentrate on modeling body shape using additional modalities including silhouette, pose, and body mesh, potentially causing the model to overlook other critical biometric traits such as gender, age, and style, or they incorporate supervision through additional labels that the model tries to disregard or emphasize, such as clothing or personal attributes. However, these annotations are discrete in nature and do not capture comprehensive descriptions. In this work, we propose DIFFER: Disentangle Identity Features From Entangled Representations, a novel adversarial learning method that leverages textual descriptions to disentangle identity features. Recognizing that image features inherently mix inseparable information, DIFFER introduces NBDetach, a mechanism designed for feature disentanglement by leveraging the separable nature of text descriptions as supervision. It partitions the feature space into distinct subspaces and, through gradient reversal layers, effectively separates identity-related features from non-biometric features. We evaluate DIFFER on 4 different benchmark datasets (LTCC, PRCC, CelebreID-Light, and CCVID) to demonstrate its effectiveness and provide state-of-the-art performance across all the benchmarks. DIFFER consistently outperforms the baseline method, with improvements in top-1 accuracy of 3.6% on LTCC, 3.4% on PRCC, 2.5% on CelebReID-Light, and 1% on CCVID. Our code can be found here.
衣物变化人物再识别(CC-ReID)的目标是在不同的穿着场景下识别人体。当前的CC-ReID方法要么专注于使用额外模态如轮廓、姿态和身体网格来建模人体形状,这可能导致模型忽略性别、年龄和风格等其他关键生物特征;要么通过引入附加标签(例如衣物或个人属性)进行监督,这些标签可能被模型试图忽略或强调。然而,这些注释本质上是离散的,并不能捕捉全面描述。 为此,我们提出了DIFFER:从缠结表示中分离身份特征的一种新颖对抗学习方法,该方法利用文本描述来分离身份特征。认识到图像特征内在地混合了不可分割的信息,DIFFER引入了NBDetach机制,这是一种通过利用文本描述作为监督的可分性质来进行特征解纠缠的设计。它将特征空间划分为不同的子空间,并通过使用梯度反转层有效地区分与身份相关的特征和非生物识别特征。 我们在4个不同基准数据集(LTCC、PRCC、CelebreID-Light 和 CCVID)上评估了DIFFER,以证明其有效性并提供所有基准上的最新性能。与基线方法相比,DIFFER在LTCC上提高了3.6%的top-1准确率,在PRCC上提高了3.4%,在CelebReID-Light上提高了2.5%,在CCVID上提高了1%。 我们的代码可以在这里找到:[此处提供链接]。
https://arxiv.org/abs/2503.22912
Person re-identification (ReID) plays a critical role in applications like security surveillance and criminal investigations by matching individuals across large image galleries captured by non-overlapping cameras. Traditional ReID methods rely on unimodal inputs, typically images, but face limitations due to challenges like occlusions, lighting changes, and pose variations. While advancements in image-based and text-based ReID systems have been made, the integration of both modalities has remained under-explored. This paper presents FusionSegReID, a multimodal model that combines both image and text inputs for enhanced ReID performance. By leveraging the complementary strengths of these modalities, our model improves matching accuracy and robustness, particularly in complex, real-world scenarios where one modality may struggle. Our experiments show significant improvements in Top-1 accuracy and mean Average Precision (mAP) for ReID, as well as better segmentation results in challenging scenarios like occlusion and low-quality images. Ablation studies further confirm that multimodal fusion and segmentation modules contribute to enhanced re-identification and mask accuracy. The results show that FusionSegReID outperforms traditional unimodal models, offering a more robust and flexible solution for real-world person ReID tasks.
人员重新识别(ReID)在安全监控和刑事调查等应用中扮演着关键角色,通过将个体跨多个非重叠摄像头捕获的大规模图像库进行匹配。传统的ReID方法依赖于单一模态输入,通常是图像,但由于遮挡、光照变化和姿态变异等问题的挑战性,这些方法面临限制。尽管基于图像和文本的ReID系统的改进已取得进展,但两种模式结合的研究却相对较少。本文介绍了FusionSegReID模型,这是一个多模态模型,它结合了图像和文本输入来增强ReID性能。通过利用这两种模式互补的优势,我们的模型提高了匹配准确性和鲁棒性,在单一模式可能遇到困难的复杂现实场景中尤为突出。实验结果显示,与传统单模态模型相比,FusionSegReID在Top-1精度和平均精度(mAP)方面显著提升了人员重新识别性能,并且在遮挡和低质量图像等具有挑战性的环境中实现了更优的分割结果。消融研究表明,多模态融合和分割模块对于增强重识别准确性和掩码准确性起到了重要作用。总之,实验结果表明FusionSegReID超越了传统的单模态模型,在实际应用中提供了更为稳健且灵活的解决方案以应对人员重新识别任务。
https://arxiv.org/abs/2503.21595
During criminal investigations, images of persons of interest directly influence the success of identification procedures. However, law enforcement agencies often face challenges related to the scarcity of high-quality images or their obsolescence, which can affect the accuracy and success of people searching processes. This paper introduces a novel forensic mugshot augmentation framework aimed at addressing these limitations. Our approach enhances the identification probability of individuals by generating additional, high-quality images through customizable data augmentation techniques, while maintaining the biometric integrity and consistency of the original data. Several experimental results show that our method significantly improves identification accuracy and robustness across various forensic scenarios, demonstrating its effectiveness as a trustworthy tool law enforcement applications. Index Terms: Digital Forensics, Person re-identification, Feature extraction, Data augmentation, Visual-Language models.
在刑事调查中,涉案人员的图像直接影响到识别程序的成功率。然而,执法机构常常面临高质量图片稀缺或过时的问题,这些问题会影响搜寻和确认身份过程的准确性和成功率。本文介绍了一种新的法医照片增强框架,旨在解决这些限制。我们的方法通过使用可定制的数据增强技术生成额外的、高质量的图像来提高个体识别的概率,同时保持原始数据的生物测量完整性和一致性。多个实验结果表明,我们提出的方法在各种法医场景中显著提高了识别准确率和鲁棒性,展示了其作为执法应用中的可信工具的有效性。 关键词:数字取证,人员再识别,特征提取,数据增强,视觉-语言模型。
https://arxiv.org/abs/2503.19478
Real-world surveillance systems are dynamically evolving, requiring a person Re-identification model to continuously handle newly incoming data from various domains. To cope with these dynamics, Lifelong ReID (LReID) has been proposed to learn and accumulate knowledge across multiple domains incrementally. However, LReID models need to be trained on large-scale labeled data for each unseen domain, which are typically inaccessible due to privacy and cost concerns. In this paper, we propose a new paradigm called Continual Few-shot ReID (CFReID), which requires models to be incrementally trained using few-shot data and tested on all seen domains. Under few-shot conditions, CFREID faces two core challenges: 1) learning knowledge from few-shot data of unseen domain, and 2) avoiding catastrophic forgetting of seen domains. To tackle these two challenges, we propose a Stable Distribution Alignment (SDA) framework from feature distribution perspective. Specifically, our SDA is composed of two modules, i.e., Meta Distribution Alignment (MDA) and Prototype-based Few-shot Adaptation (PFA). To support the study of CFReID, we establish an evaluation benchmark for CFReID on five publicly available ReID datasets. Extensive experiments demonstrate that our SDA can enhance the few-shot learning and anti-forgetting capabilities under few-shot conditions. Notably, our approach, using only 5\% of the data, i.e., 32 IDs, significantly outperforms LReID's state-of-the-art performance, which requires 700 to 1,000 IDs.
现实世界中的监控系统在不断演变,需要一个人重识别模型能够持续处理来自不同领域的新的数据。为应对这种动态变化,提出了终身重识别(Lifelong ReID,简称LReID)方法,旨在跨多个领域逐步学习和累积知识。然而,LReID模型通常需要针对每个未见过的领域使用大量标记的数据进行训练,这在实际操作中由于隐私保护和成本问题往往难以实现。 为了解决这些问题,在本文中我们提出了一个名为连续少量样本重识别(Continual Few-shot ReID, 简称CFReID)的新范式。在这种框架下,模型需要通过少量样本数据逐步进行增量训练,并在所有已见过的领域上进行测试。在少量样本条件下,CFReID面临两个核心挑战:1) 从未见过领域的少量样本数据中学习知识;2) 避免对已见过领域的遗忘现象。 为应对这两个挑战,我们提出了一种名为稳定分布对齐(Stable Distribution Alignment, 简称SDA)的框架,并从特征分布的角度出发解决这些问题。具体来说,我们的SDA包括两个模块:元分布对齐(Meta Distribution Alignment, 简称MDA)和基于原型的少量样本适应(Prototype-based Few-shot Adaptation, 简称PFA)。为了支持CFReID的研究工作,我们建立了五个公开可用重识别数据集上的评估基准。大量的实验结果表明,在少量样本条件下,我们的SDA能够增强模型的学习能力和抗遗忘能力。 值得注意的是,通过仅使用5%的数据(即32个身份),我们提出的方法在性能上显著超越了LReID的最佳表现方法,后者需要700到1,000个身份的数据。
https://arxiv.org/abs/2503.18469