Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide specialized features for learning diverse instances, which enables VLMs to learn new instances in-context in a one-shot manner. Further, IIR-VLM leverages this knowledge for instance-aware visual understanding. We validate IIR-VLM's efficacy on existing instance personalization benchmarks. Finally, we demonstrate its superior ILR performance on a challenging new benchmark, which assesses ILR capabilities across varying difficulty and diverse categories, with person, face, pet and general objects as the instances at task.
实例级识别(ILR)涉及区分个体实例,其中行人再识别是一个显著的例子。尽管现代视觉语言模型(VLMs)在视觉感知能力方面表现出色,但我们在其实例级识别性能上发现令人不满意的结果,通常大幅落后于特定领域的ILR模型。这种限制阻碍了VLM在许多实际应用中的有效性,例如,在有效视觉理解中需要认出熟悉的人和物体的场景中。现有的解决方案通常通过使用特定于每个实例的数据集一次一个地学习来实现实例识别,这不仅会产生大量的数据收集和训练成本,而且难以进行细微的区分。 为此,我们提出了IIR-VLM(In-context Instance-level Recognition增强型视觉语言模型),这是一种经过优化以在上下文中执行一次性实例级识别任务的VLM。我们在该模型中集成了预训练的ILR专家模型作为辅助视觉编码器,为学习多样化的实例提供了专门的功能特征,这使得VLM能够在一次输入后学会新的实例(即无需额外数据训练)。此外,IIR-VLM利用这种知识来实现对实例的理解。 我们已经在现有的实例个性化基准测试上验证了IIR-VLM的有效性。最后,我们在一个具有挑战性的新基准上展示了其优越的ILR性能评估,在这个基准中,评估了不同难度和多样类别下的ILR能力,任务中的实例包括人、面部、宠物以及一般对象。
https://arxiv.org/abs/2601.14188
We introduce a method for decentralized person re-identification in robot swarms that leverages natural language as the primary representational modality. Unlike traditional approaches that rely on opaque visual embeddings -- high-dimensional feature vectors extracted from images -- the proposed method uses human-readable language to represent observations. Each robot locally detects and describes individuals using a vision-language model (VLM), producing textual descriptions of appearance instead of feature vectors. These descriptions are compared and clustered across the swarm without centralized coordination, allowing robots to collaboratively group observations of the same individual. Each cluster is distilled into a representative description by a language model, providing an interpretable, concise summary of the swarm's collective perception. This approach enables natural-language querying, enhances transparency, and supports explainable swarm behavior. Preliminary experiments demonstrate competitive performance in identity consistency and interpretability compared to embedding-based methods, despite current limitations in text similarity and computational load. Ongoing work explores refined similarity metrics, semantic navigation, and the extension of language-based perception to environmental elements. This work prioritizes decentralized perception and communication, while active navigation remains an open direction for future study.
我们提出了一种在机器人集群中进行去中心化人员再识别的方法,该方法利用自然语言作为主要的表现形式。与传统的依赖于不透明的视觉嵌入(从图像提取的高度维特征向量)的方法不同,所提出的方法使用人类可读的语言来表示观察结果。每个机器人本地检测并描述个体,并使用视觉-语言模型(VLM)生成外观的文字描述,而不是特征向量。这些描述在没有集中协调的情况下在整个集群中进行比较和聚类,使机器人能够协作地对同一个人的观察结果进行分组。每个聚类由一个语言模型提炼成一个代表性的描述,提供了一个可解释且简洁的群体感知总结。这种方法支持自然语言查询、增强透明度并支持可解释群集行为。初步实验表明,在身份一致性以及与基于嵌入的方法相比在可解释性方面表现出了竞争力,尽管目前文本相似性和计算负荷方面存在限制。正在进行的工作探索了更精细的相似度指标、语义导航以及将基于语言的感知扩展到环境元素的可能性。这项工作优先考虑去中心化感知和通信,而主动导航仍然是未来研究的一个开放方向。
https://arxiv.org/abs/2601.12479
The core of video-based visible-infrared person re-identification (VVI-ReID) lies in learning sequence-level modal-invariant representations across different modalities. Recent research tends to use modality-shared language prompts generated by CLIP to guide the learning of modal-invariant representations. Despite achieving optimal performance, such methods still face limitations in efficient spatial-temporal modeling, sufficient cross-modal interaction, and explicit modality-level loss guidance. To address these issues, we propose the language-driven sequence-level modal-invariant representation learning (LSMRL) method, which includes spatial-temporal feature learning (STFL) module, semantic diffusion (SD) module and cross-modal interaction (CMI) module. To enable parameter- and computation-efficient spatial-temporal modeling, the STFL module is built upon CLIP with minimal modifications. To achieve sufficient cross-modal interaction and enhance the learning of modal-invariant features, the SD module is proposed to diffuse modality-shared language prompts into visible and infrared features to establish preliminary modal consistency. The CMI module is further developed to leverage bidirectional cross-modal self-attention to eliminate residual modality gaps and refine modal-invariant representations. To explicitly enhance the learning of modal-invariant representations, two modality-level losses are introduced to improve the features' discriminative ability and their generalization to unseen categories. Extensive experiments on large-scale VVI-ReID datasets demonstrate the superiority of LSMRL over AOTA methods.
视频基础的可见光-红外人员再识别(VVI-ReID)的核心在于学习不同模态之间的序列级模态不变表示。最近的研究倾向于使用由CLIP生成的模态共享语言提示来指导模态不变表示的学习。尽管这些方法已经达到了最优性能,但在高效的时空建模、充分的跨模态交互以及明确的模态级别损失引导方面仍然存在局限性。为了解决这些问题,我们提出了一种基于语言驱动的序列级模态不变表征学习(LSMRL)的方法,该方法包括时空特征学习(STFL)模块、语义扩散(SD)模块和跨模态交互(CMI)模块。 为了实现参数高效且计算高效的时空建模,STFL模块是在CLIP的基础上进行少量修改而构建的。为了达到充分的跨模态交互并增强模态不变特征的学习,提出了一个语义扩散模块,将模态共享的语言提示扩散到可见光和红外特征中,以建立初步的模态一致性。此外,通过发展双向跨模态自注意力机制来消除残余模态差距,并细化模态不变表示,进一步改进了CMI模块。 为了明确增强模态不变表示的学习,我们引入了两种模态级损失,以提高特征的判别能力和其对未见类别的泛化能力。在大规模VVI-ReID数据集上的广泛实验表明,LSMRL方法优于现有的AOTA(Attention-based One-to-Many Alignment)方法。
https://arxiv.org/abs/2601.12062
We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. To tackle UMS-ReID, we introduce image-text knowledge modeling (ITKM) -- a three-stage framework that effectively exploits the representational power of vision-language models. We start with a pre-trained CLIP model with an image encoder and a text encoder. In Stage I, we introduce a scenario embedding in the image encoder and fine-tune the encoder to adaptively leverage knowledge from multiple scenarios. In Stage II, we optimize a set of learned text embeddings to associate with pseudo-labels from Stage I and introduce a multi-scenario separation loss to increase the divergence between inter-scenario text representations. In Stage III, we first introduce cluster-level and instance-level heterogeneous matching modules to obtain reliable heterogeneous positive pairs (e.g., a visible image and an infrared image of the same person) within each scenario. Next, we propose a dynamic text representation update strategy to maintain consistency between text and image supervision signals. Experimental results across multiple scenarios demonstrate the superiority and generalizability of ITKM; it not only outperforms existing scenario-specific methods but also enhances overall performance by integrating knowledge from multiple scenarios.
我们提出了一种新的任务——无监督多场景(UMS)人员重新识别(ReID),该任务旨在通过单一连贯框架扩展ReID在各种场景中的应用,如跨分辨率和着装变化等。为了解决UMS-ReID问题,我们引入了图像文本知识建模(ITKM)——一个三阶段框架,有效地利用了视觉-语言模型的表征能力。 首先,从预训练的CLIP模型开始,该模型包含一个图像编码器和一个文本编码器。在第一阶段,我们在图像编码器中引入场景嵌入,并微调编码器以适应性地利用来自多个场景的知识。第二阶段,我们优化了一组学习到的文本嵌入与第一阶段生成的伪标签相关联,并提出了一个多场景分离损失函数,以便增加跨场景文本表示之间的差异。 在第三阶段,首先引入了群级别和实例级别的异构匹配模块,在每个场景中获取可靠的异构正样本对(例如,同一人的可见图像和红外图像)。接下来,我们提出了一种动态文本表征更新策略,以保持文本和图像监督信号之间的一致性。跨多个场景的实验结果表明,ITKM具有优越性和泛化能力;它不仅超越了现有的特定场景方法,并且通过整合来自多场景的知识提升了整体性能。
https://arxiv.org/abs/2601.11243
Text-to-image person re-identification (TIReID) aims to retrieve person images from a large gallery given free-form textual descriptions. TIReID is challenging due to the substantial modality gap between visual appearances and textual expressions, as well as the need to model fine-grained correspondences that distinguish individuals with similar attributes such as clothing color, texture, or outfit style. To address these issues, we propose DiCo (Disentangled Concept Representation), a novel framework that achieves hierarchical and disentangled cross-modal alignment. DiCo introduces a shared slot-based representation, where each slot acts as a part-level anchor across modalities and is further decomposed into multiple concept blocks. This design enables the disentanglement of complementary attributes (\textit{e.g.}, color, texture, shape) while maintaining consistent part-level correspondence between image and text. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that our framework achieves competitive performance with state-of-the-art methods, while also enhancing interpretability through explicit slot- and block-level representations for more fine-grained retrieval results.
文本到图像的人再识别(TIReID)的目标是从大型图片库中检索出给定自由格式的文字描述所对应的人体图像。由于视觉外观与文字表达之间存在显著的模态差距,以及需要建模精细粒度的对应关系以区分具有类似属性(如服装颜色、纹理或着装风格)的个体,TIReID变得极具挑战性。为了应对这些问题,我们提出了一种新的框架DiCo(分离概念表示),该框架实现了层次化和独立的跨模式对齐。 DiCo引入了基于共享槽位的表示方法,在这种方法中,每个槽作为一个跨模态的部分级锚点,并进一步分解成多个概念块。这种设计使得可以将互补属性(如颜色、纹理、形状)分离出来,同时在图像与文本之间保持一致的部分级对应关系。广泛的实验表明,我们的框架在CUHK-PEDES、ICFG-PEDES和RSTPReid数据集上实现了具有竞争力的性能,并通过显式的槽位及块级别表示提升了模型的可解释性,从而能够提供更精细粒度的检索结果。
https://arxiv.org/abs/2601.10053
Tracklet quality is often treated as an afterthought in most person re-identification (ReID) methods, with the majority of research presenting architectural modifications to foundational models. Such approaches neglect an important limitation, posing challenges when deploying ReID systems in real-world, difficult scenarios. In this paper, we introduce S3-CLIP, a video super-resolution-based CLIP-ReID framework developed for the VReID-XFD challenge at WACV 2026. The proposed method integrates recent advances in super-resolution networks with task-driven super-resolution pipelines, adapting them to the video-based person re-identification setting. To the best of our knowledge, this work represents the first systematic investigation of video super-resolution as a means of enhancing tracklet quality for person ReID, particularly under challenging cross-view conditions. Experimental results demonstrate performance competitive with the baseline, achieving 37.52% mAP in aerial-to-ground and 29.16% mAP in ground-to-aerial scenarios. In the ground-to-aerial setting, S3-CLIP achieves substantial gains in ranking accuracy, improving Rank-1, Rank-5, and Rank-10 performance by 11.24%, 13.48%, and 17.98%, respectively.
在大多数人物重新识别(ReID)方法中,轨迹质量通常被当作次要问题处理,而多数研究集中在对基础模型进行架构上的改进。这种做法忽视了一个重要的局限性,在实际部署ReID系统时会遇到挑战,尤其是在复杂且困难的场景下。在这篇论文中,我们介绍了S3-CLIP框架,这是一个基于视频超分辨率(Video Super-Resolution)和CLIP-ReID技术相结合的方法,并用于WACV 2026年举办的VReID-XFD竞赛。该方法将最近在超分辨率网络中的进展与任务驱动的超分辨率流程集成起来,适应于基于视频的人物重新识别场景。 据我们所知,这项工作是首次系统性地研究视频超分辨率技术如何提升轨迹质量以增强人物ReID性能的研究,特别是在跨视角(cross-view)条件下。实验结果表明,我们的方法在性能上与基准线相当,并且在从空中到地面和从地面到空中的场景中分别达到了37.52%的mAP和29.16%的mAP。尤其是在地面到空中设置的情况下,S3-CLIP框架显著提高了排名准确度,在Rank-1、Rank-5和Rank-10性能方面分别提升了11.24%,13.48%,以及17.98%。
https://arxiv.org/abs/2601.08807
Video-based Person Re-IDentification (VPReID) aims to retrieve the same person from videos captured by non-overlapping cameras. At extreme far distances, VPReID is highly challenging due to severe resolution degradation, drastic viewpoint variation and inevitable appearance noise. To address these issues, we propose a Scale-Adaptive framework with Shape Priors for VPReID, named SAS-VPReID. The framework is built upon three complementary modules. First, we deploy a Memory-Enhanced Visual Backbone (MEVB) to extract discriminative feature representations, which leverages the CLIP vision encoder and multi-proxy memory. Second, we propose a Multi-Granularity Temporal Modeling (MGTM) to construct sequences at multiple temporal granularities and adaptively emphasize motion cues across scales. Third, we incorporate Prior-Regularized Shape Dynamics (PRSD) to capture body structure dynamics. With these modules, our framework can obtain more discriminative feature representations. Experiments on the VReID-XFD benchmark demonstrate the effectiveness of each module and our final framework ranks the first on the VReID-XFD challenge leaderboard. The source code is available at this https URL.
基于视频的人再识别(VPReID)的目标是从不同摄像头捕获的视频中检索同一个人。在极端远距离情况下,由于分辨率严重退化、视角变化剧烈以及不可避免的外观噪声,VPReID变得非常具有挑战性。为解决这些问题,我们提出了一种名为SAS-VPReID的尺度自适应框架,并引入形状先验。该框架基于三个互补模块构建: 1. 我们部署了一个增强型视觉骨干网(Memory-Enhanced Visual Backbone, MEVB),用于提取判别性的特征表示,它利用了CLIP视觉编码器和多代理记忆。 2. 提出了一个多粒度时间建模(Multi-Granularity Temporal Modeling, MGTM)模块,以在多个时间粒度上构建序列,并适应性地强调不同尺度上的运动线索。 3. 我们融入了一种先验正则化的形状动态模型(Prior-Regularized Shape Dynamics, PRSD),用于捕捉人体结构的变化。 通过这些模块的结合,我们的框架可以获得更具判别性的特征表示。在VReID-XFD基准测试上进行的实验展示了每个模块的有效性,并且我们最终的框架在VReID-XFD挑战排行榜中排名第一。源代码可在上述链接获取。
https://arxiv.org/abs/2601.05535
Face Attribute Recognition (FAR) plays a crucial role in applications such as person re-identification, face retrieval, and face editing. Conventional multi-task attribute recognition methods often process the entire feature map for feature extraction and attribute classification, which can produce redundant features due to reliance on global regions. To address these challenges, we propose a novel approach emphasizing the selection of specific feature regions for efficient feature learning. We introduce the Mask-Guided Multi-Task Network (MGMTN), which integrates Adaptive Mask Learning (AML) and Group-Global Feature Fusion (G2FF) to address the aforementioned limitations. Leveraging a pre-trained keypoint annotation model and a fully convolutional network, AML accurately localizes critical facial parts (e.g., eye and mouth groups) and generates group masks that delineate meaningful feature regions, thereby mitigating negative transfer from global region usage. Furthermore, G2FF combines group and global features to enhance FAR learning, enabling more precise attribute identification. Extensive experiments on two challenging facial attribute recognition datasets demonstrate the effectiveness of MGMTN in improving FAR performance.
面部属性识别(FAR)在人员再识别、人脸检索和人脸编辑等应用中扮演着重要角色。传统的多任务属性识别方法通常在整个特征图上进行特征提取和属性分类,这可能会由于依赖全局区域而导致冗余特征的产生。为了解决这些问题,我们提出了一种新的方法,强调选择特定的特征区域以实现高效的特征学习。为此,我们引入了Mask-Guided Multi-Task Network(MGMTN),该网络结合了自适应掩码学习(Adaptive Mask Learning, AML)和组全局特征融合(Group-Global Feature Fusion, G2FF),旨在解决上述限制。 AML利用预训练的关键点标注模型和全卷积网络,能够精确定位关键面部部分(如眼睛和嘴巴等区域),并生成分组掩码以界定有意义的特征区域,从而减轻使用全局区域所带来的负面转移效应。此外,G2FF结合了组内和全局特征,增强FAR的学习能力,使得属性识别更加精确。 在两个具有挑战性的面部属性识别数据集上的广泛实验表明,MGMTN能够有效提升FAR性能。
https://arxiv.org/abs/2601.01408
Person re-identification (ReID) plays a critical role in intelligent surveillance systems by linking identities across multiple cameras in complex environments. However, ReID faces significant challenges such as appearance variations, domain shifts, and limited labeled data. This dissertation proposes three advanced approaches to enhance ReID performance under supervised, unsupervised domain adaptation (UDA), and fully unsupervised settings. First, SCM-ReID integrates supervised contrastive learning with hybrid loss optimization (classification, center, triplet, and centroid-triplet losses), improving discriminative feature representation and achieving state-of-the-art accuracy on Market-1501 and CUHK03 datasets. Second, for UDA, IQAGA and DAPRH combine GAN-based image augmentation, domain-invariant mapping, and pseudo-label refinement to mitigate domain discrepancies and enhance cross-domain generalization. Experiments demonstrate substantial gains over baseline methods, with mAP and Rank-1 improvements up to 12% in challenging transfer scenarios. Finally, ViTC-UReID leverages Vision Transformer-based feature encoding and camera-aware proxy learning to boost unsupervised ReID. By integrating global and local attention with camera identity constraints, this method significantly outperforms existing unsupervised approaches on large-scale benchmarks. Comprehensive evaluations across CUHK03, Market-1501, DukeMTMC-reID, and MSMT17 confirm the effectiveness of the proposed methods. The contributions advance ReID research by addressing key limitations in feature learning, domain adaptation, and label noise handling, paving the way for robust deployment in real-world surveillance systems.
人员再识别(ReID)在智能监控系统中扮演着关键角色,通过连接复杂环境中多个摄像头的身份信息来实现这一点。然而,ReID面临着诸如外观变化、领域迁移和标记数据有限等重大挑战。本文提出三种先进方法以提升监督学习、无监督领域适应(UDA)以及完全无监督设置下的ReID性能。 首先,SCM-ReID结合了监督对比学习与混合损失优化(包括分类、中心点、三元组及重心-三元组损失),增强了判别性特征表示,并在Market-1501和CUHK03数据集上实现了最先进的准确性。其次,在UDA背景下,IQAGA和DAPRH通过结合基于GAN的图像增强、领域不变映射以及伪标签优化来缓解域差异并提高跨域泛化能力。实验结果表明这些方法在具有挑战性的迁移场景中相比基准方法在mAP和Rank-1指标上分别提高了高达12%。最后,ViTC-UReID利用基于Vision Transformer的特征编码与相机感知代理学习来提升无监督ReID性能。通过整合全局及局部注意力,并结合相机身份约束,该方法在大规模基准测试中显著超越了现有的无监督方法。 对CUHK03、Market-1501、DukeMTMC-reID和MSMT17的数据集进行全面评估后确认了所提出方法的有效性。这些贡献通过解决特征学习、领域适应及标记噪声处理的关键限制,为ReID研究铺平了道路,并为其在实际监控系统的稳健部署开启了新的方向。
https://arxiv.org/abs/2601.01356
Person re-identification (ReID) across aerial and ground views at extreme far distances introduces a distinct operating regime where severe resolution degradation, extreme viewpoint changes, unstable motion cues, and clothing variation jointly undermine the appearance-based assumptions of existing ReID systems. To study this regime, we introduce VReID-XFD, a video-based benchmark and community challenge for extreme far-distance (XFD) aerial-to-ground person re-identification. VReID-XFD is derived from the DetReIDX dataset and comprises 371 identities, 11,288 tracklets, and 11.75 million frames, captured across altitudes from 5.8 m to 120 m, viewing angles from oblique (30 degrees) to nadir (90 degrees), and horizontal distances up to 120 m. The benchmark supports aerial-to-aerial, aerial-to-ground, and ground-to-aerial evaluation under strict identity-disjoint splits, with rich physical metadata. The VReID-XFD-25 Challenge attracted 10 teams with hundreds of submissions. Systematic analysis reveals monotonic performance degradation with altitude and distance, a universal disadvantage of nadir views, and a trade-off between peak performance and robustness. Even the best-performing SAS-PReID method achieves only 43.93 percent mAP in the aerial-to-ground setting. The dataset, annotations, and official evaluation protocols are publicly available at this https URL .
跨视角的人员重新识别(Person Re-Identification,简称ReID)在极远距离下从空中视角切换到地面视角时引入了一种独特的操作环境。在这种环境中,严重的分辨率退化、极端的角度变化、不稳定的动作线索以及服装差异共同削弱了现有基于外观假设的重识别系统的有效性。为了研究这一领域,我们提出了VReID-XFD,这是一个针对极远距离(XFD)空中到地面人员重新识别的视频基准和社区挑战。 VReID-XFD来源于DetReIDX数据集,并包含371个身份、11,288条轨迹片段以及超过1175万帧图像。这些图像在高度从5.8米至120米,视角角度从斜侧(30度)到正下方(90度),水平距离高达120米的情况下被捕捉下来。 该基准测试支持空中到空中、空中到地面以及地面到空中的评估,在严格的身份独立划分下进行,并提供了丰富的物理元数据。VReID-XFD-25挑战吸引了来自世界各地的10个团队,提交了数百份作品。系统分析揭示出随着高度和距离增加性能下降的趋势,正下方视角在所有场景下的普遍劣势,以及峰值性能与鲁棒性之间的权衡关系。 即使是最优的方法SAS-PReID,在空中到地面设置中也仅达到了43.93%的平均精度(mAP)。该数据集、标注及官方评估协议可在[提供的链接]处获取。
https://arxiv.org/abs/2601.01312
Person re-identification (ReID) is an extremely important area in both surveillance and mobile applications, requiring strong accuracy with minimal computational cost. State-of-the-art methods give good accuracy but with high computational budgets. To remedy this, this paper proposes VisNet, a computationally efficient and effective re-identification model suitable for real-world scenarios. It is the culmination of conceptual contributions, including feature fusion at multiple scales with automatic attention on each, semantic clustering with anatomical body partitioning, a dynamic weight averaging technique to balance classification semantic regularization, and the use of loss function FIDI for improved metric learning tasks. The multiple scales fuse ResNet50's stages 1 through 4 without the use of parallel paths, with semantic clustering introducing spatial constraints through the use of rule-based pseudo-labeling. VisNet achieves 87.05% Rank-1 and 77.65% mAP on the Market-1501 dataset, having 32.41M parameters and 4.601 GFLOPs, hence, proposing a practical approach for real-time deployment in surveillance and mobile applications where computational resources are limited.
人员再识别(ReID)在监控和移动应用领域是非常重要的一个方面,它需要在计算成本极低的情况下保持高精度。虽然最先进的方法提供了很好的准确率,但它们的计算开销较大。为解决这一问题,本文提出了VisNet,这是一种计算效率高且有效的人体再识别模型,适用于现实场景。该模型集成了多个概念贡献,包括多尺度特征融合和自动注意力机制、基于解剖学身体部位划分的语义聚类、用于平衡分类语义正则化的动态权重平均技术以及利用损失函数FIDI改进度量学习任务。 VisNet在不使用并行路径的情况下融合了ResNet50第1至4阶段的特征,通过规则基础伪标签引入空间约束。实验表明,在Market-1501数据集上,VisNet实现了87.05%的Rank-1准确率和77.65%的mAP(平均精度),并且参数量为32.41M,计算复杂度为4.601 GFLOPs。因此,VisNet提出了一种在监控和移动应用中具有实际意义的方法,在这些场景下计算资源有限的情况下仍可实现实时部署。
https://arxiv.org/abs/2601.00307
Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as "re-indexing". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. We verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.
终身人员重识别(L-ReID)利用连续收集的数据来不断训练和更新重识别模型,重点关注所有数据的整体性能。其主要挑战在于避免在新数据训练过程中出现旧知识的灾难性遗忘问题。现有的L-ReID方法通常会在每次更新后为所有历史图像重新提取特征进行推理,称为“再索引”。然而,由于数据隐私问题以及对大规模历史图库图像重新索引的成本高昂,历史图库数据往往无法直接保存下来。因此,这不可避免地导致了查询特征与更新前的模型提取的历史图库特征之间的不兼容检索,严重影响了重识别性能。 为解决上述问题,本文提出了一项新的任务——无再索引终身人员重识别(RFL-ReID),要求在不重新索引历史图库图像的情况下进行终身人员重识别。因此,相较于L-ReID而言,RFL-ReID更具挑战性,它需要连续学习并在多样化的流数据中平衡新旧知识,并使更新前后的模型输出的特征具有兼容性。 为此,我们提出了双向持续兼容表示(Bi-C2R)框架,以不断更新由旧模型提取的历史图库特征,从而实现高效且兼容性的L-ReID。通过理论分析和在多个基准测试上的广泛实验验证了所提出的Bi-C2R方法的有效性,表明该方法在介绍的RFL-ReID任务以及传统的L-ReID任务上均能取得领先性能。
https://arxiv.org/abs/2512.25000
3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain's adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.
三维人体姿态估计(3D HPE)在众多应用中至关重要,从人员再识别和动作识别到虚拟现实。然而,在受控环境中收集的标注3D数据依赖性导致了将其推广至多样化的真实场景中的挑战。现有的域适应(DA)范式,例如针对3D HPE的一般域适应和无源域适应,忽略了目标姿态数据集的非平稳性问题。为了解决这些挑战,我们提出了一项名为终身域适配的三维人体姿态估计的新任务。据我们所知,这是我们首次将终身域适应应用于三维人体姿态估计任务中。在这种终身DA设置下,姿态估计器首先在源域上进行预训练,随后逐步调整以适应不同的目标域。此外,在当前目标域的调适过程中,姿态估计器无法访问源域和所有之前的target domains。 针对3D HPE的终身DA需要克服当前领域的姿势适应挑战,并保留来自先前领域知识的问题,尤其是要解决灾难性遗忘问题。为此,我们提出了一种创新性的生成对抗网络(GAN)框架,该框架结合了三维姿态生成器、二维姿态判别器和三维姿态估计器。此框架有效地缓解了域偏移并使原始姿势与增强后的姿势对齐。 此外,我们构建了一个新颖的3D姿态生成器模式,整合了具有姿态感知、时间感知和领域感知的知识,以提升当前领域的适应性,并减轻在先前领域的灾难性遗忘问题。通过广泛的实验验证,在多样化的三维域适配人体姿态估计数据集上,我们的方法展现出了卓越的表现。
https://arxiv.org/abs/2512.23860
State-of-the-art person re-identification methods achieve impressive accuracy but remain largely opaque, leaving open the question: which high-level semantic attributes do these models actually rely on? We propose MoSAIC-ReID, a Mixture-of-Experts framework that systematically quantifies the importance of pedestrian attributes for re-identification. Our approach uses LoRA-based experts, each linked to a single attribute, and an oracle router that enables controlled attribution analysis. While MoSAIC-ReID achieves competitive performance on Market-1501 and DukeMTMC under the assumption that attribute annotations are available at test time, its primary value lies in providing a large-scale, quantitative study of attribute importance across intrinsic and extrinsic cues. Using generalized linear models, statistical tests, and feature-importance analyses, we reveal which attributes, such as clothing colors and intrinsic characteristics, contribute most strongly, while infrequent cues (e.g. accessories) have limited effect. This work offers a principled framework for interpretable ReID and highlights the requirements for integrating explicit semantic knowledge in practice. Code is available at this https URL
最先进的人员重新识别方法实现了令人印象深刻的准确度,但其内部运作仍相对不透明,这引发了一个问题:这些模型实际上依赖哪些高层次的语义属性?我们提出了MoSAIC-ReID,这是一种Mixture-of-Experts(专家混合)框架,能够系统地量化行人属性在重新识别中的重要性。我们的方法使用基于LoRA(低秩适应)的专家模块,每个模块与单一属性相关联,并且采用了一个oracle router(预言路由器),它能进行受控归因分析。 尽管假设测试时可以获取到属性标注的情况下,MoSAIC-ReID在Market-1501和DukeMTMC数据集上达到了竞争力的性能,但其主要价值在于通过大规模、定量研究来揭示跨内在和外在线索的各种属性的重要性。我们利用广义线性模型、统计测试以及特征重要度分析方法,揭示了哪些属性(如衣物颜色及内在特性)贡献最为显著,而那些不常见的提示(例如配饰)则影响有限。 这项工作为可解释的人员重新识别提供了一个原则性的框架,并强调了在实践中整合显式语义知识的需求。代码可以在提供的网址上获取。
https://arxiv.org/abs/2512.08697
Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match individuals across visible and infrared cameras without relying on any annotation. Given the significant gap across visible and infrared modality, estimating reliable cross-modality association becomes a major challenge in USVI-ReID. Existing methods usually adopt optimal transport to associate the intra-modality clusters, which is prone to propagating the local cluster errors, and also overlooks global instance-level relations. By mining and attending to the visible-infrared modality bias, this paper focuses on addressing cross-modality learning from two aspects: bias-mitigated global association and modality-invariant representation learning. Motivated by the camera-aware distance rectification in single-modality re-ID, we propose modality-aware Jaccard distance to mitigate the distance bias caused by modality discrepancy, so that more reliable cross-modality associations can be estimated through global clustering. To further improve cross-modality representation learning, a `split-and-contrast' strategy is designed to obtain modality-specific global prototypes. By explicitly aligning these prototypes under global association guidance, modality-invariant yet ID-discriminative representation learning can be achieved. While conceptually simple, our method obtains state-of-the-art performance on benchmark VI-ReID datasets and outperforms existing methods by a significant margin, validating its effectiveness.
无监督可见光-红外重识别(USVI-ReID)旨在不依赖任何标注的情况下,在可见光和红外摄像机之间匹配个体。鉴于可见光与红外模态之间的显著差异,估计可靠的跨模态关联成为USVI-ReID的主要挑战之一。现有方法通常采用最优传输来关联单个模态下的聚类,这种做法容易传播局部集群误差,并且忽视了全局实例级关系。 本文通过挖掘和关注可见光-红外模态偏差,从两个方面解决了跨模态学习问题:减少偏差的全局关联与模态不变表示学习。受单一模态重识别中摄像机感知距离校正的启发,我们提出了一种模态感知的雅克距离(modality-aware Jaccard distance),用于缓解由模态差异引起的距离偏差,从而通过全局聚类估计出更可靠的跨模态关联。 为了进一步改进跨模态表示学习,设计了一种“分离并对比”策略来获取特定于每个模态的全局原型。通过在全局关联引导下明确对齐这些原型,可以实现既模态不变又具有身份区分能力的表示学习。尽管概念简单,本文提出的方法在基准VI-ReID数据集上取得了最先进的性能,并且显著优于现有方法,验证了其有效性。
https://arxiv.org/abs/2512.07760
Visible-Infrared Person Re-Identification (VI-ReID) is a challenging cross-modal matching task due to significant modality discrepancies. While current methods mainly focus on learning modality-invariant features through unified embedding spaces, they often focus solely on the common discriminative semantics across modalities while disregarding the critical role of modality-specific identity-aware knowledge in discriminative feature learning. To bridge this gap, we propose a novel Identity Clue Refinement and Enhancement (ICRE) network to mine and utilize the implicit discriminative knowledge inherent in modality-specific attributes. Initially, we design a Multi-Perception Feature Refinement (MPFR) module that aggregates shallow features from shared branches, aiming to capture modality-specific attributes that are easily overlooked. Then, we propose a Semantic Distillation Cascade Enhancement (SDCE) module, which distills identity-aware knowledge from the aggregated shallow features and guide the learning of modality-invariant features. Finally, an Identity Clues Guided (ICG) Loss is proposed to alleviate the modality discrepancies within the enhanced features and promote the learning of a diverse representation space. Extensive experiments across multiple public datasets clearly show that our proposed ICRE outperforms existing SOTA methods.
可见光-红外人物重新识别(VI-ReID)是一项具有挑战性的跨模态匹配任务,由于存在显著的模态差异性。虽然当前的方法主要集中在通过统一嵌入空间学习模态不变特征上,但它们往往只关注各模态之间共同判别语义的同时忽略了特定模态身份感知知识在判别特征学习中的关键作用。为了填补这一空白,我们提出了一种新颖的身份线索精炼与增强(ICRE)网络来挖掘并利用特定模态属性中蕴含的隐性判别知识。 首先,我们设计了一个多感知特征精炼(MPFR)模块,该模块聚合了来自共享分支的浅层特征,旨在捕捉容易被忽视的特定于模态的属性。然后,我们提出了一个语义蒸馏级联增强(SDCE)模块,该模块从聚合的浅层特征中蒸馏出身份感知知识,并指导模态不变特征的学习过程。最后,提出了一种由身份线索引导(ICG)损失函数,以减轻增强后的特征中的模态差异,并促进多样表示空间的学习。 通过多个公共数据集进行的广泛实验清楚地表明,我们提出的ICRE方法优于现有的最先进的方法。
https://arxiv.org/abs/2512.04522
Two-stage learning pipeline has achieved promising results in unsupervised visible-infrared person re-identification (USL-VI-ReID). It first performs single-modality learning and then operates cross-modality learning to tackle the modality discrepancy. Although promising, this pipeline inevitably introduces modality bias: modality-specific cues learned in the single-modality training naturally propagate into the following cross-modality learning, impairing identity discrimination and generalization. To address this issue, we propose a Dual-level Modality Debiasing Learning (DMDL) framework that implements debiasing at both the model and optimization levels. At the model level, we propose a Causality-inspired Adjustment Intervention (CAI) module that replaces likelihood-based modeling with causal modeling, preventing modality-induced spurious patterns from being introduced, leading to a low-biased model. At the optimization level, a Collaborative Bias-free Training (CBT) strategy is introduced to interrupt the propagation of modality bias across data, labels, and features by integrating modality-specific augmentation, label refinement, and feature alignment. Extensive experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.
在无监督的可见光-红外行人重识别(USL-VI-ReID)中,两阶段学习管道已经取得了显著成果。它首先进行单模态学习,然后执行跨模态学习以解决模态差异问题。尽管这一流程显示出前景,但不可避免地引入了模态偏见:在单一模态训练中学到的特定于某一模态的线索自然会传递到随后的跨模态学习中,从而损害身份鉴别能力和泛化能力。 为了解决这个问题,我们提出了一种双层级消偏学习(DMDL)框架,在模型和优化两个层面实施了消偏策略。在模型层面上,我们提出了一个因果启发调整干预(CAI)模块,该模块用因果建模替代基于似然性的建模方法,防止由特定模态导致的虚假模式被引入系统中,从而生成低偏见的模型。在优化层面上,则引入了一种协作无偏训练(CBT)策略,通过整合特定于某一模态的数据增强、标签精炼和特征对齐来中断数据、标签及特征之间的模态偏见传播。 广泛的基准数据集实验表明,DMDL框架能够实现模态不变的特征学习,并提高模型的泛化能力。
https://arxiv.org/abs/2512.03745
Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address this, We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks. Our contributions are threefold: 1) We proposed OmniPerson, a unified generation model, offering holistic and fine-grained control over all key pedestrian attributes. Supporting RGB/IR modality image/video generation with any number of reference images, two kinds of person poses, and text. Also including RGB-to-IR transfer and image super-resolution abilities.2) We designed Multi-Refer Fuser for robust identity preservation with any number of reference images as input, making OmniPerson could distill a unified identity from a set of multi-view reference images, ensuring our generated pedestrians achieve high-fidelity pedestrian generation.3) We introduce PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation, and present its automated curation pipeline which transforms public, ID-only ReID benchmarks into a richly annotated resource with the dense, multi-modal supervision required for this task. Experimental results demonstrate that OmniPerson achieves SoTA in pedestrian generation, excelling in both visual fidelity and identity consistency. Furthermore, augmenting existing datasets with our generated data consistently improves the performance of ReID models. We will open-source the full codebase, pretrained model, and the PersonSyn dataset.
人员重识别(ReID)面临缺乏大规模高质量训练数据的问题,这主要是由于数据隐私和标注成本的挑战。虽然之前的方法探索了行人生成以进行数据增强,但它们往往无法确保身份一致性,并且控制力不足,从而限制了在数据集扩充中的有效性。为解决这一问题,我们引入了OmniPerson,这是第一个统一的身份保持型行人生成管道,适用于可见光/红外图像和视频ReID任务。 我们的贡献有三点: 1. 我们提出了OmniPerson,一个统一的生成模型,提供对所有关键行人属性的整体性和细致控制。支持RGB/IR模态图像/视频生成,并可以使用任意数量的参考图片、两种类型的人体姿态以及文本作为输入。此外还具备从RGB到IR转换和图像超分辨率能力。 2. 我们设计了多参照融合器(Multi-Refer Fuser),用于在输入任意数量的参考图片时进行稳健的身份保持,使OmniPerson能够从一组多视角参考图集中提取统一身份,并确保我们生成的行人具有高保真度的人行者生成效果。 3. 我们引入了PersonSyn,这是第一个大规模数据集,专门针对可引用的可控性行人生成。还介绍了其自动编目管道,该管道可以将仅凭ID的公共ReID基准转换为完成密集且多模态监督所需的任务丰富注释资源。 实验结果表明,OmniPerson在行人的生成方面达到了最新的技术水平,在视觉保真度和身份一致性上都表现出色。此外,通过使用我们生成的数据增强现有数据集,能够持续提升ReID模型的性能。我们将开源完整的代码库、预训练模型以及PersonSyn数据集。
https://arxiv.org/abs/2512.02554
Generalizable person re-identification (Re-ID) aims to recognize individuals across unseen cameras and environments. While existing methods rely heavily on limited labeled multi-camera data, we propose DynaMix, a novel method that effectively combines manually labeled multi-camera and large-scale pseudo-labeled single-camera data. Unlike prior works, DynaMix dynamically adapts to the structure and noise of the training data through three core components: (1) a Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly; (2) an Efficient Centroids Module that maintains robust identity representations under a large identity space; and (3) a Data Sampling Module that carefully composes mixed data mini-batches to balance learning complexity and intra-batch diversity. All components are specifically designed to operate efficiently at scale, enabling effective training on millions of images and hundreds of thousands of identities. Extensive experiments demonstrate that DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID.
https://arxiv.org/abs/2511.19067
Sketch based person re-identification aims to match hand-drawn sketches with RGB surveillance images, but remains challenging due to significant modality gaps and limited annotated data. To address this, we introduce KTCAA, a theoretically grounded framework for few-shot cross-modal generalization. Motivated by generalization theory, we identify two key factors influencing target domain risk: (1) domain discrepancy, which quantifies the alignment difficulty between source and target distributions; and (2) perturbation invariance, which evaluates the model's robustness to modality shifts. Based on these insights, we propose two components: (1) Alignment Augmentation (AA), which applies localized sketch-style transformations to simulate target distributions and facilitate progressive alignment; and (2) Knowledge Transfer Catalyst (KTC), which enhances invariance by introducing worst-case perturbations and enforcing consistency. These modules are jointly optimized under a meta-learning paradigm that transfers alignment knowledge from data-rich RGB domains to sketch-based scenarios. Experiments on multiple benchmarks demonstrate that KTCAA achieves state-of-the-art performance, particularly in data-scarce conditions.
https://arxiv.org/abs/2511.18677