Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.
https://arxiv.org/abs/2604.15090
In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate spatiotemporal features, reducing computational overhead. We evaluate our approach on two standard datasets (MARS and iLIDS-VID) and two newly constructed high-difficulty datasets (SportsVReID and DanceVReID). Experimental results demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements across all benchmarks.
https://arxiv.org/abs/2604.07740
Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate aerial-ground domain shifts, and further refine retrieval using improved optimization and k-reciprocal re-ranking. Experiments on the DetReIDX stress-test benchmark show that our approach achieves mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A), corresponding to an overall mAP of 35.73. These results show that large-scale vision-language backbones, when combined with stability-focused adaptation, significantly enhance robustness in extreme far-distance video person ReID.
极端远距离视频行人重识别(ReID)由于尺度压缩、分辨率退化、运动模糊以及空中-地面视角不匹配而极具挑战性。随着摄像机高度和主体距离的增加,在近距图像上训练的模型性能会显著下降。本研究探索了如何调整大规模视觉-语言模型以在此类条件下可靠运行。我们从基于CLIP的基线出发,将视觉主干网络从ViT-B/16升级至ViT-L/14,并引入主干网络感知的选择性微调以稳定大型Transformer的适应过程。为应对噪声和低分辨率的轨迹段,我们融入了轻量级时序注意力池化机制,以抑制退化帧并突显信息丰富的观测。我们保留了基于适配器和提示条件化的跨视角学习策略以缓解空中-地面域偏移,并进一步通过改进的优化与k-倒数重排序技术精炼检索结果。在DetReIDX压力测试基准上的实验表明,我们的方法在A2G、G2A和A2A场景下分别达到46.69、41.23和22.98的mAP分数,整体mAP为35.73。这些结果证明,结合稳定性导向的适配策略,大规模视觉-语言主干网络能显著提升极端远距离视频行人ReID的鲁棒性。
https://arxiv.org/abs/2604.04183
Person Re-Identification (ReID) faces severe challenges from modality discrepancy and clothing variation in long-term surveillance scenario. While existing studies have made significant progress in either Visible-Infrared ReID (VI-ReID) or Clothing-Change ReID (CC-ReID), real-world surveillance system often face both challenges simultaneously. To address this overlooked yet realistic problem, we define a new task, termed Cross-Modality Clothing-Change Re-Identification (CMCC-ReID), which targets pedestrian matching across variations in both modality and clothing. To advance research in this direction, we construct a new benchmark SYSU-CMCC, where each identity is captured in both visible and infrared domains with distinct outfits, reflecting the dual heterogeneity of long-term surveillance. To tackle CMCC-ReID, we propose a Progressive Identity Alignment Network (PIA) that progressively mitigates the issues of clothing variation and modality discrepancy. Specifically, a Dual-Branch Disentangling Learning (DBDL) module separates identity-related cues from clothing-related factors to achieve clothing-agnostic representation, and a Bi-Directional Prototype Learning (BPL) module performs intra-modality and inter-modality contrast in the embedding space to bridge the modality gap while further suppressing clothing interference. Extensive experiments on the SYSU-CMCC dataset demonstrate that PIA establishes a strong baseline for this new task and significantly outperforms existing methods.
行人重识别(ReID)在长期监控场景中面临模态差异与衣着变化的双重严峻挑战。尽管现有研究在可见光-红外ReID(VI-ReID)或衣着变化ReID(CC-ReID)单一方面已取得显著进展,但实际监控系统往往需同时应对这两种问题。针对这一被忽视却极具现实意义的问题,我们定义了一项新任务——跨模态衣着变化重识别(CMCC-ReID),旨在实现跨模态与衣着变化双重变异下的行人匹配。为推进该方向研究,我们构建了新基准数据集SYSU-CMCC,其中每个身份均以不同衣着在可见光与红外双模态下采集,以反映长期监控中的双重异构性。为应对CMCC-ReID挑战,我们提出渐进式身份对齐网络(PIA),通过逐步缓解衣着变化与模态差异问题。具体而言,双分支解耦学习(DBDL)模块将身份相关特征与衣着相关因素分离,以获得与衣着无关的表征;双向原型学习(BPL)模块在嵌入空间中进行模态内与模态间对比,在弥合模态差距的同时进一步抑制衣着干扰。在SYSU-CMCC数据集上的大量实验表明,PIA为该新任务建立了强基线,并显著超越现有方法。
https://arxiv.org/abs/2604.02808
Federated learning (FL) has emerged as a promising paradigm for privacy-preserving multi-camera video understanding. However, applying FL to cross-view scenarios faces three major challenges: (i) heterogeneous viewpoints and backgrounds lead to highly non-IID client distributions and overfitting to view-specific patterns, (ii) local distribution biases cause misaligned representations that hinder consistent cross-view semantics, and (iii) large video architectures incur prohibitive communication overhead. To address these issues, we propose FedCVU, a federated framework with three components: VS-Norm, which preserves normalization parameters to handle view-specific statistics; CV-Align, a lightweight contrastive regularization module to improve cross-view representation alignment; and SLA, a selective layer aggregation strategy that reduces communication without sacrificing accuracy. Extensive experiments on action understanding and person re-identification tasks under a cross-view protocol demonstrate that FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and showing robustness to domain heterogeneity and communication constraints.
联邦学习(FL)作为一种有前景的隐私保护多摄像头视频理解范式已受到广泛关注。然而,将FL应用于跨视角场景面临三大挑战:(i)异构视角与背景导致客户端数据分布高度非独立同分布,且模型易过拟合至视角特定模式;(ii)本地分布偏差导致表征错位,阻碍跨视角语义一致性;(iii)大型视频架构带来高昂的通信开销。针对上述问题,我们提出FedCVU联邦框架,包含三个核心组件:VS-Norm通过保留归一化参数处理视角特定统计信息;CV-Align作为轻量级对比正则化模块提升跨视角表征对齐;SLA采用选择性层聚合策略在保持精度的同时降低通信成本。在跨视角协议下的动作理解与行人重识别任务中,大量实验表明FedCVU在维持已见视角性能的同时持续提升未见视角准确率,其表现优于现有联邦学习基线方法,并对领域异构性与通信约束展现出强鲁棒性。
https://arxiv.org/abs/2603.21647
Recent vision-language models such as CLIP provide strong cross-modal alignment, but current CLIP-guided ReID pipelines rely on global features and fixed prompts. This limits their ability to capture fine-grained attribute cues and adapt to diverse appearances. We propose ALADIN, an attribute-language distillation network that distills knowledge from a frozen CLIP teacher to a lightweight ReID student. ALADIN introduces fine-grained attribute-local alignment to establish adaptive text-visual correspondence and robust representation learning. A Scene-Aware Prompt Generator produces image-specific soft prompts to facilitate adaptive alignment. Attribute-local distillation enforces consistency between textual attributes and local visual features, significantly enhancing robustness under occlusions. Furthermore, we employ cross-modal contrastive and relation distillation to preserve the inherent structural relationships among attributes. To provide precise supervision, we leverage Multimodal LLMs to generate structured attribute descriptions, which are then converted into localized attention maps via CLIP. At inference, only the student is used. Experiments on Market-1501, DukeMTMC-reID, and MSMT17 show improvements over CNN-, Transformer-, and CLIP-based methods, with better generalization and interpretability.
近期视觉语言模型(如CLIP)展现出强大的跨模态对齐能力,但当前基于CLIP引导的行人重识别(ReID)流程仍依赖全局特征与固定提示词,这限制了其捕捉细粒度属性线索及适应多样外观的能力。我们提出ALADIN——一种属性-语言蒸馏网络,可将冻结的CLIP教师模型的知识迁移至轻量级ReID学生模型。ALADIN引入细粒度属性-局部对齐机制,以建立自适应文本-视觉对应关系并实现鲁棒的表征学习。其场景感知提示生成器可产生图像相关的软提示词,促进自适应对齐;属性局部蒸馏则强制文本属性与局部视觉特征保持一致,显著提升遮挡场景下的鲁棒性。此外,我们采用跨模态对比蒸馏与关系蒸馏,以保留属性间的固有结构关系。为提供精确监督,我们利用多模态大语言模型生成结构化属性描述,再通过CLIP将其转换为局部化注意力图。推理阶段仅需学生网络。在Market-1501、DukeMTMC-reID与MSMT17数据集上的实验表明,该方法优于基于CNN、Transformer及CLIP的现有方案,具有更优泛化性与可解释性。
https://arxiv.org/abs/2603.21482
Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9\%-2.2\% and 2.1\%-2.5\% on anti-forgetting and generalization capacity. Our source code is available at this https URL
终身行人重识别(LReID)旨在从变化域中学习以获得统一的行人检索模型。现有LReID方法通常聚焦于从零开始或基于视觉分类预训练模型进行学习,而视觉语言模型(VLM)已在多种任务中展现出泛化知识。尽管现有方法可直接适配VLM,但由于它们仅考虑全局感知学习,细粒度属性知识未得到充分利用,导致知识获取与抗遗忘能力受限。针对此问题,我们提出一种新型VLM驱动的LReID方法,名为视觉语言属性解耦与增强(VLADR)。核心思想是显式建模人类通用共享属性以提升跨域知识迁移,从而有效利用历史知识强化新知识学习并缓解遗忘。具体而言,VLADR包含多粒度文本属性解耦机制,用于挖掘图像的全局与多样化局部文本属性。进而设计跨域跨模态属性增强方案,通过引入跨模态属性对齐引导视觉属性提取,并采用跨域属性对齐实现细粒度知识迁移。实验结果表明,我们的VLADR在抗遗忘与泛化能力上较最优方法提升1.9%-2.2%与2.1%-2.5%。源代码已公开于此https链接。
https://arxiv.org/abs/2603.19678
Visible-infrared person re-identification faces greater challenges than traditional person re-identification due to the significant differences between modalities. In particular, the differences between these modalities make effective matching even more challenging, mainly because existing re-ranking algorithms cannot simultaneously address the intra-modal variations and inter-modal discrepancy in cross-modal person re-identification. To address this problem, we propose a novel Progressive Modal Relationship Re-ranking method consisting of two modules, called heterogeneous and homogeneous consistency re-ranking(HHCR). The first module, heterogeneous consistency re-ranking, explores the relationship between the query and the gallery modalities in the test set. The second module, homogeneous consistency reranking, investigates the intrinsic relationship within each modality between the query and the gallery in the test set. Based on this, we propose a baseline for cross-modal person re-identification, called a consistency re-ranking inference network (CRI). We conducted comprehensive experiments demonstrating that our proposed re-ranking method is generalized, and both the re-ranking and the baseline achieve state-of-the-art performance.
https://arxiv.org/abs/2603.16165
Visible-Infrared Person Re-Identification (VI-ReID) is a challenging retrieval task due to the substantial modality gap between visible and infrared images. While existing methods attempt to bridge this gap by learning modality-invariant features within a shared embedding space, they often overlook the complex and implicit correlations between modalities. This limitation becomes more severe under distribution shifts, where infrared samples are often far fewer than visible ones. To address these challenges, we propose a novel network termed Bi-directional Interaction Transformation (BIT). Instead of relying on rigid feature alignment, BIT adopts a matching-based strategy that explicitly models the interaction between visible and infrared image pairs. Specifically, BIT employs an encoder-decoder architecture where the encoder extracts preliminary feature representations, and the decoder performs bi-directional feature integration and query aware scoring to enhance cross-modality correspondence. To our best knowledge, BIT is the first to introduce such pairwise matching-driven interaction in VI-ReID. Extensive experiments on several benchmarks demonstrate that our BIT achieves state-of-the-art performance, highlighting its effectiveness in the VI-ReID task.
https://arxiv.org/abs/2603.14243
Domain Generalized person Re-identification (DG Re-ID) is a challenging task, where models are trained on source domains but tested on unseen target domains. Although previous pure vision-based models have achieved significant progress, the performance remains further improved. Recently, Vision-Language Models (VLMs) present outstanding generalization capabilities in various visual applications. However, directly adapting a VLM to Re-ID shows limited generalization improvement. This is because the VLM only produces with global features that are insensitive to ID nuances. To tacle this problem, we propose a CLIP-based multi-grained vision-language alignment framework in this work. Specifically, several multi-grained prompts are introduced in language modality to describe different body parts and align with their counterparts in vision modality. To obtain fine-grained visual information, an adaptively masked multi-head self-attention module is employed to precisely extract specific part features. To train the proposed module, an MLLM-based visual grounding expert is employed to automatically generate pseudo labels of body parts for supervision. Extensive experiments conducted on both single- and multi-source generalization protocols demonstrate the superior performance of our approach. The implementation code will be released at this https URL.
https://arxiv.org/abs/2603.14012
Federated Domain Generalization for Person Re-Identification (FedDG-ReID) learns domain-invariant representations from decentralized data. While Vision Transformer (ViT) is widely adopted, its global attention often fails to distinguish pedestrians from high similarity backgrounds or diverse viewpoints -- a challenge amplified by cross-client distribution shifts in FedDG-ReID. To address this, we propose Federated Body Distribution Aware Visual Prompt (FedBPrompt), introducing learnable visual prompts to guide Transformer attention toward pedestrian-centric regions. FedBPrompt employs a Body Distribution Aware Visual Prompts Mechanism (BAPM) comprising: Holistic Full Body Prompts to suppress cross-client background noise, and Body Part Alignment Prompts to capture fine-grained details robust to pose and viewpoint variations. To mitigate high communication costs, we design a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and updates only lightweight prompts, significantly reducing communication overhead while maintaining adaptability. Extensive experiments demonstrate that BAPM effectively enhances feature discrimination and cross-domain generalization, while PFTS achieves notable performance gains within only a few aggregation rounds. Moreover, both BAPM and PFTS can be easily integrated into existing ViT-based FedDG-ReID frameworks, making FedBPrompt a flexible and effective solution for federated person re-identification. The code is available at this https URL.
https://arxiv.org/abs/2603.12912
The application of federated domain generalization in person re-identification (FedDG-ReID) aims to enhance the model's generalization ability in unseen domains while protecting client data privacy. However, existing mainstream methods typically rely on global feature representations and simple averaging operations for model aggregation, leading to two limitations in domain generalization: (1) Using only global features makes it difficult to capture subtle, domain-invariant local details (such as accessories or textures); (2) Uniform parameter averaging treats all clients as equivalent, ignoring their differences in robust feature extraction capabilities, thereby diluting the contributions of high quality clients. To address these issues, we propose a novel federated learning framework, Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration (FedARKS), comprising two mechanisms: RK (Robust Knowledge) and KS (Knowledge Selection).
在人员再识别(Person Re-Identification,ReID)中应用联合领域泛化(Federated Domain Generalization, FedDG-ReID),旨在增强模型在未见过领域的泛化能力的同时保护客户端数据隐私。然而,现有的主流方法通常依赖于全局特征表示和简单的平均操作来进行模型聚合,这导致了领域泛化中的两个限制:(1) 仅使用全局特征难以捕捉细微的、跨域不变的局部细节(如配饰或纹理);(2) 统一参数平均将所有客户端视为等同,忽视它们在鲁棒特征提取能力上的差异,从而稀释了高质量客户端的贡献。为解决这些问题,我们提出了一种新颖的联邦学习框架——通过鲁棒和判别性知识选择与集成实现的联合聚合(Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration, FedARKS),该框架包含两个机制:RK(Robust Knowledge)和KS(Knowledge Selection)。
https://arxiv.org/abs/2603.06122
Cloth-Changing Person Re-Identification (CC-ReID) aims to match the same individual across cameras under varying clothing conditions. Existing approaches often remove apparel and focus on the head region to reduce clothing bias. However, treating the head holistically without distinguishing between face and hair leads to over-reliance on volatile hairstyle cues, causing performance degradation under hairstyle changes. To address this issue, we propose the Mitigating Hairstyle Distraction and Structural Preservation (MSP) framework. Specifically, MSP introduces Hairstyle-Oriented Augmentation (HSOA), which generates intra-identity hairstyle diversity to reduce hairstyle dependence and enhance attention to stable facial and body cues. To prevent the loss of structural information, we design Cloth-Preserved Random Erasing (CPRE), which performs ratio-controlled erasing within clothing regions to suppress texture bias while retaining body shape and context. Furthermore, we employ Region-based Parsing Attention (RPA) to incorporate parsing-guided priors that highlight face and limb regions while suppressing hair features. Extensive experiments on multiple CC-ReID benchmarks demonstrate that MSP achieves state-of-the-art performance, providing a robust and practical solution for long-term person re-identification.
服装变化人员再识别(CC-ReID)旨在跨不同摄像机条件下,在穿着条件不同的情况下匹配同一个人的身份。现有的方法通常会去除衣物,并专注于头部区域以减少着装偏差。然而,这种整体处理头部而不区分脸部和头发的方式导致过分依赖发型线索,从而在发型发生变化时性能下降。 为了解决这个问题,我们提出了减轻发型干扰并保持结构信息的框架(MSP)。具体来说,MSP引入了面向发型的数据增强技术(HSOA),该技术生成同一身份下的发型多样性以减少对发型的依赖,并且突出稳定的面部和身体线索。为了防止丢失结构信息,我们设计了一种保留衣物的随机擦除方法(CPRE),这种方法在服装区域进行比例控制的擦除操作,从而抑制纹理偏差同时保持体型和上下文信息。 此外,我们采用了基于区域解析注意的方法(RPA)来融入解析引导的先验知识,该方法突出面部和四肢区域,并抑制头发特征。在多个CC-ReID基准测试上的广泛实验表明,MSP达到了最先进的性能水平,为长期人员再识别提供了稳健且实用的解决方案。
https://arxiv.org/abs/2603.01640
With the increasing demand for robust person Re-ID in unconstrained environments, learning from datasets with noisy labels and sparse per-identity samples remains a critical challenge. Existing noise-robust person Re-ID methods primarily rely on loss-correction or sample-selection strategies using softmax outputs. However, these methods suffer from two key limitations: 1) Softmax exhibits translation invariance, leading to over-confident and unreliable predictions on corrupted labels. 2) Conventional sample selection based on small-loss criteria often discards valuable hard positives that are crucial for learning discriminative features. To overcome these issues, we propose the CAlibration-to-REfinement (CARE) method, a two-stage framework that seeks certainty through probabilistic evidence propagation from calibration to refinement. In the calibration stage, we propose the probabilistic evidence calibration (PEC) that dismantles softmax translation invariance by injecting adaptive learnable parameters into the similarity function, and employs an evidential calibration loss to mitigate overconfidence on mislabeled samples. In the refinement stage, we design the evidence propagation refinement (EPR) that can more accurately distinguish between clean and noisy samples. Specifically, the EPR contains two steps: Firstly, the composite angular margin (CAM) metric is proposed to precisely distinguish clean but hard-to-learn positive samples from mislabeled ones in a hyperspherical space; Secondly, the certainty-oriented sphere weighting (COSW) is developed to dynamically allocate the importance of samples according to CAM, ensuring clean instances drive model updates. Extensive experimental results on Market1501, DukeMTMC-ReID, and CUHK03 datasets under both random and patterned noises show that CARE achieves competitive performance.
随着对未约束环境中鲁棒人物重识别(Re-ID)需求的增加,从带有噪声标签和稀疏身份样本的数据集中学习仍然是一个关键挑战。现有的抗噪人物重识别方法主要依赖于利用softmax输出进行损失校正或样本选择策略。然而,这些方法存在两个关键限制:1) softmax具有平移不变性,在面对损坏的标签时会导致过于自信且不可靠的预测;2) 基于小损失标准的传统样本选择往往排除了对学习鉴别特征至关重要的有价值的困难正例。 为克服这些问题,我们提出了CAlibration-to-REfinement(CARE)方法,这是一种两阶段框架,通过校准到细化过程中的概率证据传播寻求确定性。在校准阶段,我们提出了一种基于注入适应性可学习参数分解softmax平移不变性的概率证据校正(PEC),并通过使用证据校准损失来减轻对错误标签样本的过度自信问题。在细化阶段,我们设计了证据传播细化(EPR)方法,能够更准确地区分干净和噪声样本。 具体而言,EPR包含两个步骤:首先,提出了一种复合角度余量(CAM)度量,在高维球形空间中精确区分难以学习的正例与误标样例;其次,开发了以确定性为中心的球体加权(COSW),根据CAM动态分配样本的重要性,确保干净实例驱动模型更新。 在Market1501、DukeMTMC-ReID和CUHK03数据集上进行的大量实验表明,在随机噪声和模式化噪声环境下,CARE方法实现了具有竞争力的表现。
https://arxiv.org/abs/2602.23133
City-scale person re-identification across distributed cameras must handle severe appearance changes from viewpoint, occlusion, and domain shift while complying with data protection rules that prevent sharing raw imagery. We introduce CityGuard, a topology-aware transformer for privacy-preserving identity retrieval in decentralized surveillance. The framework integrates three components. A dispersion-adaptive metric learner adjusts instance-level margins according to feature spread, increasing intra-class compactness. Spatially conditioned attention injects coarse geometry, such as GPS or deployment floor plans, into graph-based self-attention to enable projectively consistent cross-view alignment using only coarse geometric priors without requiring survey-grade calibration. Differentially private embedding maps are coupled with compact approximate indexes to support secure and cost-efficient deployment. Together these designs produce descriptors robust to viewpoint variation, occlusion, and domain shifts, and they enable a tunable balance between privacy and utility under rigorous differential-privacy accounting. Experiments on Market-1501 and additional public benchmarks, complemented by database-scale retrieval studies, show consistent gains in retrieval precision and query throughput over strong baselines, confirming the practicality of the framework for privacy-critical urban identity matching.
跨分布式摄像头的城市级人员再识别系统必须处理由于视角、遮挡和领域偏移造成的严重外观变化,同时还要遵守数据保护规则,防止分享原始图像。我们引入了CityGuard,这是一种拓扑感知变换器,用于在分散监控中实现隐私保护的身份检索。该框架整合了三个组成部分: 1. **自适应散度测度学习器**:根据特征分布调整实例级别的边界值,增加同一类别的紧密性。 2. **空间条件注意机制**:将粗略几何信息(如GPS或部署楼层图)注入基于图的自我注意中,以利用仅需粗糙几何先验的知识实现跨视角的一致对齐,而无需进行高精度校准。 3. **差分隐私嵌入映射和紧凑近似索引结合**:支持安全且成本效益高的部署。 这些设计共同生成了能够在视角变化、遮挡以及领域偏移下保持鲁棒性的描述符,并能够调整隐私与实用性的平衡,同时在严格的差分隐私计算中表现出色。实验表明,在Market-1501和其他公开基准测试上,该框架的检索精度和查询吞吐量均超越了强大的基线系统,证实了其在关键隐私保护的城市身份匹配场景中的实用性。
https://arxiv.org/abs/2602.18047
Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.
基于图像的人再识别(Re-ID)的泛化目标是跨不同摄像机在未见过的领域中识别个人,而无需重新训练。虽然目前有许多方法通过复杂的架构来解决域差距问题,但最近的研究表明,使用风格多样化的单摄像头数据可以实现更好的泛化效果。尽管这种数据容易收集,但由于视图变化较小,它缺乏复杂性。 为此,我们提出了ReText这一新方法,该方法是在多摄像机Re-ID数据和通过文本描述补充的单摄像头数据混合体上训练的。通过这种方式,在单摄像头数据中添加了丰富的语义线索。在训练过程中,ReText同时优化三个任务:(1)基于多摄像机数据的人再识别;(2)图像与文本匹配;以及(3)由文本指导的单摄像头数据中的图像重建。 实验表明,ReText在跨域人再识别基准测试中表现出强大的泛化能力,并显著优于现有方法。据我们所知,这是首次探索基于图像的人再识别任务中,在多摄像机和单摄像机数据混合体上的多模态联合学习的工作。
https://arxiv.org/abs/2602.05785
Both fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges, such as occlusion and pose variations. Vision foundation models (\textit{e.g.}, DINO) excel at mining local textures, and vision-language models (\textit{e.g.}, CLIP) capture strong global semantic difference. Existing methods predominantly rely on a single paradigm, neglecting the potential benefits of their integration. In this paper, we analyze the complementary roles of these two architectures and propose a framework to synergize their strengths by a \textbf{D}ual-\textbf{R}egularized Bidirectional \textbf{Transformer} (\textbf{DRFormer}). The dual-regularization mechanism ensures diverse feature extraction and achieves a better balance in the contributions of the two models. Extensive experiments on five benchmarks show that our method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.
https://arxiv.org/abs/2602.01059
Aerial-ground person re-identification (AG-ReID) is fundamentally challenged by extreme viewpoint and distance discrepancies between aerial and ground cameras, which induce severe geometric distortions and invalidate the assumption of a shared similarity space across views. Existing methods primarily rely on geometry-aware feature learning or appearance-conditioned prompting, while implicitly assuming that the geometry-invariant dot-product similarity used in attention mechanisms remains reliable under large viewpoint and scale variations. We argue that this assumption does not hold. Extreme camera geometry systematically distorts the query-key similarity space and degrades attention-based matching, even when feature representations are partially aligned. To address this issue, we introduce Geometry-Induced Query-Key Transformation (GIQT), a lightweight low-rank module that explicitly rectifies the similarity space by conditioning query-key interactions on camera geometry. Rather than modifying feature representations or the attention formulation itself, GIQT adapts the similarity computation to compensate for dominant geometry-induced anisotropic distortions. Building on this local similarity rectification, we further incorporate a geometry-conditioned prompt generation mechanism that provides global, view-adaptive representation priors derived directly from camera geometry. Experiments on four aerial-ground person re-identification benchmarks demonstrate that the proposed framework consistently improves robustness under extreme and previously unseen geometric conditions, while introducing minimal computational overhead compared to state-of-the-art methods.
https://arxiv.org/abs/2601.21405
Person Re-Identification (ReID) remains a challenging problem in computer vision. This work reviews various training paradigm and evaluates the robustness of state-of-the-art ReID models in cross-domain applications and examines the role of foundation models in improving generalization through richer, more transferable visual representations. We compare three training paradigms, supervised, self-supervised, and language-aligned models. Through the study the aim is to answer the following questions: Can supervised models generalize in cross-domain scenarios? How does foundation models like SigLIP2 perform for the ReID tasks? What are the weaknesses of current supervised and foundational models for ReID? We have conducted the analysis across 11 models and 9 datasets. Our results show a clear split: supervised models dominate their training domain but crumble on cross-domain data. Language-aligned models, however, show surprising robustness cross-domain for ReID tasks, even though they are not explicitly trained to do so. Code and data available at: this https URL.
https://arxiv.org/abs/2601.20598
Unlike conventional person re-identification (ReID), clothes-changing ReID (CC-ReID) presents severe challenges due to substantial appearance variations introduced by clothing changes. In this work, we propose the Quality-Aware Dual-Branch Matching (QA-ReID), which jointly leverages RGB-based features and parsing-based representations to model both global appearance and clothing-invariant structural cues. These heterogeneous features are adaptively fused through a multi-modal attention module. At the matching stage, we further design the Quality-Aware Query Adaptive Convolution (QAConv-QA), which incorporates pixel-level importance weighting and bidirectional consistency constraints to enhance robustness against clothing variations. Extensive experiments demonstrate that QA-ReID achieves state-of-the-art performance on multiple benchmarks, including PRCC, LTCC, and VC-Clothes, and significantly outperforms existing approaches under cross-clothing scenarios.
https://arxiv.org/abs/2601.19133