Head Magnetic Resonance Imaging (MRI) is routinely collected and shared for research under strict regulatory frameworks. These frameworks require removing potential identifiers before sharing. But, even after skull stripping, the brain parenchyma contains unique signatures that can match other MRIs from the same participants across databases, posing a privacy risk if additional data features are available. Current regulatory frameworks often mandate evaluating such risks based on the assessment of a certain level of reasonableness. Prior studies have already suggested that a brain MRI could enable participant linkage, but they have relied on training-based or computationally intensive methods. Here, we demonstrate that linking an individual's skull-stripped T1-weighted MRI, which may lead to re-identification if other identifiers are available, is possible using standard preprocessing followed by image similarity computation. Nearly perfect linkage accuracy was achieved in matching data samples across various time intervals, scanner types, spatial resolutions, and acquisition protocols, despite potential cognitive decline, simulating MRI matching across databases. These results aim to contribute meaningfully to the development of thoughtful, forward-looking policies in medical data sharing.
头部磁共振成像(MRI)通常在严格的监管框架下收集和共享用于研究。这些框架要求在分享数据前移除潜在的身份标识符。然而,即使去除了颅骨信息后,脑实质中仍然包含独特的特征,这些特征可以在不同数据库中的同一参与者的其他MRI图像之间进行匹配,从而构成隐私风险,特别是当有额外的数据特性可用时。现有的监管框架通常规定需要基于某一合理水平的评估来评定此类风险。 先前的研究已经表明,通过脑部MRI可以实现参与者之间的关联,但它们依赖于训练基或计算密集型方法。在这里,我们展示了使用标准预处理后进行图像相似性计算的方法,可以将去除了颅骨信息后的T1加权MRI与个人匹配起来,即使在有其他身份标识符可用的情况下也可能导致重新识别的问题。 我们在跨不同时间间隔、扫描类型、空间分辨率和采集协议的数据样本中实现了近乎完美的链接准确性,即便考虑到潜在的认知衰退。这些结果模拟了跨数据库进行MRI匹配的情况,并且旨在为医疗数据共享的发展提供有意义的贡献,制定更加周到前瞻性的政策。
https://arxiv.org/abs/2602.10043
Privacy is a human right that sustains patient-provider trust. Clinical notes capture a patient's private vulnerability and individuality, which are used for care coordination and research. Under HIPAA Safe Harbor, these notes are de-identified to protect patient privacy. However, Safe Harbor was designed for an era of categorical tabular data, focusing on the removal of explicit identifiers while ignoring the latent information found in correlations between identity and quasi-identifiers, which can be captured by modern LLMs. We first formalize these correlations using a causal graph, then validate it empirically through individual re-identification of patients from scrubbed notes. The paradox of de-identification is further shown through a diagnosis ablation: even when all other information is removed, the model can predict the patient's neighborhood based on diagnosis alone. This position paper raises the question of how we can act as a community to uphold patient-provider trust when de-identification is inherently imperfect. We aim to raise awareness and discuss actionable recommendations.
隐私是一项人权,它维系了患者与提供者之间的信任。临床笔记记录了患者的私人脆弱性和个体性,这些信息用于护理协调和研究。根据《健康保险流通与责任法案》(HIPAA)的安全港规定,这些笔记会被去标识化以保护患者隐私。然而,安全港规则是为处理类别化的表格数据而设计的,在这种情况下,主要关注的是去除显式识别符,却忽视了身份识别符和准识别符之间相关性所蕴含的信息,后者可以通过现代大型语言模型(LLM)捕捉到。 我们首先通过因果图形式化这些关联,并通过从清理后的笔记中单独重新识别患者来实证验证。去标识化的悖论还通过诊断消除进一步展现出来:即使移除了所有其他信息,仅凭诊断也能让模型预测患者的居住区域。这篇立场论文提出了一个问题——当去标识化本质上是不完美的时候,我们作为社区应该如何行动以维护患者与提供者之间的信任。我们的目标是提高意识,并讨论可操作的建议。
https://arxiv.org/abs/2602.08997
The rise of generative models has led to increased use of large-scale datasets collected from the internet, often with minimal or no data curation. This raises concerns about the inclusion of sensitive or private information. In this work, we explore the presence of pregnancy ultrasound images, which contain sensitive personal information and are often shared online. Through a systematic examination of LAION-400M dataset using CLIP embedding similarity, we retrieve images containing pregnancy ultrasound and detect thousands of entities of private information such as names and locations. Our findings reveal that multiple images have high-risk information that could enable re-identification or impersonation. We conclude with recommended practices for dataset curation, data privacy, and ethical use of public image datasets.
生成模型的兴起导致了大规模数据集的广泛使用,这些数据集通常是从互联网上收集而来,并且很少或根本没有进行数据整理。这引发了对敏感或私人信息被包含在内的担忧。在这项工作中,我们探讨了妊娠超声图像的存在情况,这类图像是在网上经常分享的并包含了高度个人化的敏感信息。通过系统地检查LAION-400M数据集中的CLIP嵌入相似性,我们检索到了含有妊娠超声图像的内容,并检测到数千个私人信息实例,如姓名和地点。我们的研究发现表明,多个图像中存在高风险的信息,这些信息可能使重新识别或冒充成为可能。最后,我们提出了关于数据整理、数据隐私以及公共图像数据集的伦理使用方面的建议实践方法。
https://arxiv.org/abs/2602.07149
Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.
基于图像的人再识别(Re-ID)的泛化目标是跨不同摄像机在未见过的领域中识别个人,而无需重新训练。虽然目前有许多方法通过复杂的架构来解决域差距问题,但最近的研究表明,使用风格多样化的单摄像头数据可以实现更好的泛化效果。尽管这种数据容易收集,但由于视图变化较小,它缺乏复杂性。 为此,我们提出了ReText这一新方法,该方法是在多摄像机Re-ID数据和通过文本描述补充的单摄像头数据混合体上训练的。通过这种方式,在单摄像头数据中添加了丰富的语义线索。在训练过程中,ReText同时优化三个任务:(1)基于多摄像机数据的人再识别;(2)图像与文本匹配;以及(3)由文本指导的单摄像头数据中的图像重建。 实验表明,ReText在跨域人再识别基准测试中表现出强大的泛化能力,并显著优于现有方法。据我们所知,这是首次探索基于图像的人再识别任务中,在多摄像机和单摄像机数据混合体上的多模态联合学习的工作。
https://arxiv.org/abs/2602.05785
Both fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges, such as occlusion and pose variations. Vision foundation models (\textit{e.g.}, DINO) excel at mining local textures, and vision-language models (\textit{e.g.}, CLIP) capture strong global semantic difference. Existing methods predominantly rely on a single paradigm, neglecting the potential benefits of their integration. In this paper, we analyze the complementary roles of these two architectures and propose a framework to synergize their strengths by a \textbf{D}ual-\textbf{R}egularized Bidirectional \textbf{Transformer} (\textbf{DRFormer}). The dual-regularization mechanism ensures diverse feature extraction and achieves a better balance in the contributions of the two models. Extensive experiments on five benchmarks show that our method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.
https://arxiv.org/abs/2602.01059
Identifying molecules from mass spectrometry (MS) data remains a fundamental challenge due to the semantic gap between physical spectral peaks and underlying chemical structures. Existing deep learning approaches often treat spectral matching as a closed-set recognition task, limiting their ability to generalize to unseen molecular scaffolds. To overcome this limitation, we propose a cross-modal alignment framework that directly maps mass spectra into the chemically meaningful molecular structure embedding space of a pretrained chemical language model. On a strict scaffold-disjoint benchmark, our model achieves a Top-1 accuracy of 42.2% in fixed 256-way zero-shot retrieval and demonstrates strong generalization under a global retrieval setting. Moreover, the learned embedding space demonstrates strong chemical coherence, reaching 95.4% accuracy in 5-way 5-shot molecular re-identification. These results suggest that explicitly integrating physical spectral resolution with molecular structure embedding is key to solving the generalization bottleneck in molecular identification from MS data.
https://arxiv.org/abs/2602.00547
Aerial-ground person re-identification (AG-ReID) is fundamentally challenged by extreme viewpoint and distance discrepancies between aerial and ground cameras, which induce severe geometric distortions and invalidate the assumption of a shared similarity space across views. Existing methods primarily rely on geometry-aware feature learning or appearance-conditioned prompting, while implicitly assuming that the geometry-invariant dot-product similarity used in attention mechanisms remains reliable under large viewpoint and scale variations. We argue that this assumption does not hold. Extreme camera geometry systematically distorts the query-key similarity space and degrades attention-based matching, even when feature representations are partially aligned. To address this issue, we introduce Geometry-Induced Query-Key Transformation (GIQT), a lightweight low-rank module that explicitly rectifies the similarity space by conditioning query-key interactions on camera geometry. Rather than modifying feature representations or the attention formulation itself, GIQT adapts the similarity computation to compensate for dominant geometry-induced anisotropic distortions. Building on this local similarity rectification, we further incorporate a geometry-conditioned prompt generation mechanism that provides global, view-adaptive representation priors derived directly from camera geometry. Experiments on four aerial-ground person re-identification benchmarks demonstrate that the proposed framework consistently improves robustness under extreme and previously unseen geometric conditions, while introducing minimal computational overhead compared to state-of-the-art methods.
https://arxiv.org/abs/2601.21405
Person Re-Identification (ReID) remains a challenging problem in computer vision. This work reviews various training paradigm and evaluates the robustness of state-of-the-art ReID models in cross-domain applications and examines the role of foundation models in improving generalization through richer, more transferable visual representations. We compare three training paradigms, supervised, self-supervised, and language-aligned models. Through the study the aim is to answer the following questions: Can supervised models generalize in cross-domain scenarios? How does foundation models like SigLIP2 perform for the ReID tasks? What are the weaknesses of current supervised and foundational models for ReID? We have conducted the analysis across 11 models and 9 datasets. Our results show a clear split: supervised models dominate their training domain but crumble on cross-domain data. Language-aligned models, however, show surprising robustness cross-domain for ReID tasks, even though they are not explicitly trained to do so. Code and data available at: this https URL.
https://arxiv.org/abs/2601.20598
Unlike conventional person re-identification (ReID), clothes-changing ReID (CC-ReID) presents severe challenges due to substantial appearance variations introduced by clothing changes. In this work, we propose the Quality-Aware Dual-Branch Matching (QA-ReID), which jointly leverages RGB-based features and parsing-based representations to model both global appearance and clothing-invariant structural cues. These heterogeneous features are adaptively fused through a multi-modal attention module. At the matching stage, we further design the Quality-Aware Query Adaptive Convolution (QAConv-QA), which incorporates pixel-level importance weighting and bidirectional consistency constraints to enhance robustness against clothing variations. Extensive experiments demonstrate that QA-ReID achieves state-of-the-art performance on multiple benchmarks, including PRCC, LTCC, and VC-Clothes, and significantly outperforms existing approaches under cross-clothing scenarios.
https://arxiv.org/abs/2601.19133
In skeleton-based human activity understanding, existing methods often adopt the contrastive learning paradigm to construct a discriminative feature space. However, many of these approaches fail to exploit the structural inter-class similarities and overlook the impact of anomalous positive samples. In this study, we introduce ACLNet, an Affinity Contrastive Learning Network that explores the intricate clustering relationships among human activity classes to improve feature discrimination. Specifically, we propose an affinity metric to refine similarity measurements, thereby forming activity superclasses that provide more informative contrastive signals. A dynamic temperature schedule is also introduced to adaptively adjust the penalty strength for various superclasses. In addition, we employ a margin-based contrastive strategy to improve the separation of hard positive and negative samples within classes. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, Kinetics-Skeleton, PKU-MMD, FineGYM, and CASIA-B demonstrate the superiority of our method in skeleton-based action recognition, gait recognition, and person re-identification. The source code is available at this https URL.
https://arxiv.org/abs/2601.16694
Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide specialized features for learning diverse instances, which enables VLMs to learn new instances in-context in a one-shot manner. Further, IIR-VLM leverages this knowledge for instance-aware visual understanding. We validate IIR-VLM's efficacy on existing instance personalization benchmarks. Finally, we demonstrate its superior ILR performance on a challenging new benchmark, which assesses ILR capabilities across varying difficulty and diverse categories, with person, face, pet and general objects as the instances at task.
实例级识别(ILR)涉及区分个体实例,其中行人再识别是一个显著的例子。尽管现代视觉语言模型(VLMs)在视觉感知能力方面表现出色,但我们在其实例级识别性能上发现令人不满意的结果,通常大幅落后于特定领域的ILR模型。这种限制阻碍了VLM在许多实际应用中的有效性,例如,在有效视觉理解中需要认出熟悉的人和物体的场景中。现有的解决方案通常通过使用特定于每个实例的数据集一次一个地学习来实现实例识别,这不仅会产生大量的数据收集和训练成本,而且难以进行细微的区分。 为此,我们提出了IIR-VLM(In-context Instance-level Recognition增强型视觉语言模型),这是一种经过优化以在上下文中执行一次性实例级识别任务的VLM。我们在该模型中集成了预训练的ILR专家模型作为辅助视觉编码器,为学习多样化的实例提供了专门的功能特征,这使得VLM能够在一次输入后学会新的实例(即无需额外数据训练)。此外,IIR-VLM利用这种知识来实现对实例的理解。 我们已经在现有的实例个性化基准测试上验证了IIR-VLM的有效性。最后,我们在一个具有挑战性的新基准上展示了其优越的ILR性能评估,在这个基准中,评估了不同难度和多样类别下的ILR能力,任务中的实例包括人、面部、宠物以及一般对象。
https://arxiv.org/abs/2601.14188
Psychiatric narratives encode patient identity not only through explicit identifiers but also through idiosyncratic life events embedded in their clinical structure. Existing de-identification approaches, including PHI masking and LLM-based synthetic rewriting, operate at the text level and offer limited control over which semantic elements are preserved or altered. We introduce Anonpsy, a de-identification framework that reformulates the task as graph-guided semantic rewriting. Anonpsy (1) converts each narrative into a semantic graph encoding clinical entities, temporal anchors, and typed relations; (2) applies graph-constrained perturbations that modify identifying context while preserving clinically essential structure; and (3) regenerates text via graph-conditioned LLM generation. Evaluated on 90 clinician-authored psychiatric case narratives, Anonpsy preserves diagnostic fidelity while achieving consistently low re-identification risk under expert, semantic, and GPT-5-based evaluations. Compared with a strong LLM-only rewriting baseline, Anonpsy yields substantially lower semantic similarity and identifiability. These results demonstrate that explicit structural representations combined with constrained generation provide an effective approach to de-identification for psychiatric narratives.
精神健康叙述通过明确的身份标识符以及嵌入在临床结构中的个人化生活事件来编码患者身份。现有的去识别方法,包括PHI屏蔽和基于LLM的合成重写,在文本层面操作,并且对保留或更改哪些语义元素控制有限。我们引入了Anonpsy,这是一种以引导语义重写的图谱为指导的去标识框架。Anonpsy(1)将每个叙述转换成一个包含临床实体、时间锚点和类型化关系的语义图;(2)应用受约束的图谱修改,改变识别性上下文的同时保留临床结构的关键部分;(3)通过条件生成的LLM再生文本。在对90个由精神科医生撰写的案例叙述进行评估时,Anonpsy在保持诊断准确性的同时,在专家、语义和GPT-5基线测试中均表现出一致低水平的再识别风险。与一个强大的基于LLM的重写基准相比,Anonpsy显著降低了语义相似性和可标识性。这些结果表明,结合显式结构表示和受约束生成的方法为精神健康叙述的去标识提供了有效途径。
https://arxiv.org/abs/2601.13503
We introduce a method for decentralized person re-identification in robot swarms that leverages natural language as the primary representational modality. Unlike traditional approaches that rely on opaque visual embeddings -- high-dimensional feature vectors extracted from images -- the proposed method uses human-readable language to represent observations. Each robot locally detects and describes individuals using a vision-language model (VLM), producing textual descriptions of appearance instead of feature vectors. These descriptions are compared and clustered across the swarm without centralized coordination, allowing robots to collaboratively group observations of the same individual. Each cluster is distilled into a representative description by a language model, providing an interpretable, concise summary of the swarm's collective perception. This approach enables natural-language querying, enhances transparency, and supports explainable swarm behavior. Preliminary experiments demonstrate competitive performance in identity consistency and interpretability compared to embedding-based methods, despite current limitations in text similarity and computational load. Ongoing work explores refined similarity metrics, semantic navigation, and the extension of language-based perception to environmental elements. This work prioritizes decentralized perception and communication, while active navigation remains an open direction for future study.
我们提出了一种在机器人集群中进行去中心化人员再识别的方法,该方法利用自然语言作为主要的表现形式。与传统的依赖于不透明的视觉嵌入(从图像提取的高度维特征向量)的方法不同,所提出的方法使用人类可读的语言来表示观察结果。每个机器人本地检测并描述个体,并使用视觉-语言模型(VLM)生成外观的文字描述,而不是特征向量。这些描述在没有集中协调的情况下在整个集群中进行比较和聚类,使机器人能够协作地对同一个人的观察结果进行分组。每个聚类由一个语言模型提炼成一个代表性的描述,提供了一个可解释且简洁的群体感知总结。这种方法支持自然语言查询、增强透明度并支持可解释群集行为。初步实验表明,在身份一致性以及与基于嵌入的方法相比在可解释性方面表现出了竞争力,尽管目前文本相似性和计算负荷方面存在限制。正在进行的工作探索了更精细的相似度指标、语义导航以及将基于语言的感知扩展到环境元素的可能性。这项工作优先考虑去中心化感知和通信,而主动导航仍然是未来研究的一个开放方向。
https://arxiv.org/abs/2601.12479
The core of video-based visible-infrared person re-identification (VVI-ReID) lies in learning sequence-level modal-invariant representations across different modalities. Recent research tends to use modality-shared language prompts generated by CLIP to guide the learning of modal-invariant representations. Despite achieving optimal performance, such methods still face limitations in efficient spatial-temporal modeling, sufficient cross-modal interaction, and explicit modality-level loss guidance. To address these issues, we propose the language-driven sequence-level modal-invariant representation learning (LSMRL) method, which includes spatial-temporal feature learning (STFL) module, semantic diffusion (SD) module and cross-modal interaction (CMI) module. To enable parameter- and computation-efficient spatial-temporal modeling, the STFL module is built upon CLIP with minimal modifications. To achieve sufficient cross-modal interaction and enhance the learning of modal-invariant features, the SD module is proposed to diffuse modality-shared language prompts into visible and infrared features to establish preliminary modal consistency. The CMI module is further developed to leverage bidirectional cross-modal self-attention to eliminate residual modality gaps and refine modal-invariant representations. To explicitly enhance the learning of modal-invariant representations, two modality-level losses are introduced to improve the features' discriminative ability and their generalization to unseen categories. Extensive experiments on large-scale VVI-ReID datasets demonstrate the superiority of LSMRL over AOTA methods.
视频基础的可见光-红外人员再识别(VVI-ReID)的核心在于学习不同模态之间的序列级模态不变表示。最近的研究倾向于使用由CLIP生成的模态共享语言提示来指导模态不变表示的学习。尽管这些方法已经达到了最优性能,但在高效的时空建模、充分的跨模态交互以及明确的模态级别损失引导方面仍然存在局限性。为了解决这些问题,我们提出了一种基于语言驱动的序列级模态不变表征学习(LSMRL)的方法,该方法包括时空特征学习(STFL)模块、语义扩散(SD)模块和跨模态交互(CMI)模块。 为了实现参数高效且计算高效的时空建模,STFL模块是在CLIP的基础上进行少量修改而构建的。为了达到充分的跨模态交互并增强模态不变特征的学习,提出了一个语义扩散模块,将模态共享的语言提示扩散到可见光和红外特征中,以建立初步的模态一致性。此外,通过发展双向跨模态自注意力机制来消除残余模态差距,并细化模态不变表示,进一步改进了CMI模块。 为了明确增强模态不变表示的学习,我们引入了两种模态级损失,以提高特征的判别能力和其对未见类别的泛化能力。在大规模VVI-ReID数据集上的广泛实验表明,LSMRL方法优于现有的AOTA(Attention-based One-to-Many Alignment)方法。
https://arxiv.org/abs/2601.12062
We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. To tackle UMS-ReID, we introduce image-text knowledge modeling (ITKM) -- a three-stage framework that effectively exploits the representational power of vision-language models. We start with a pre-trained CLIP model with an image encoder and a text encoder. In Stage I, we introduce a scenario embedding in the image encoder and fine-tune the encoder to adaptively leverage knowledge from multiple scenarios. In Stage II, we optimize a set of learned text embeddings to associate with pseudo-labels from Stage I and introduce a multi-scenario separation loss to increase the divergence between inter-scenario text representations. In Stage III, we first introduce cluster-level and instance-level heterogeneous matching modules to obtain reliable heterogeneous positive pairs (e.g., a visible image and an infrared image of the same person) within each scenario. Next, we propose a dynamic text representation update strategy to maintain consistency between text and image supervision signals. Experimental results across multiple scenarios demonstrate the superiority and generalizability of ITKM; it not only outperforms existing scenario-specific methods but also enhances overall performance by integrating knowledge from multiple scenarios.
我们提出了一种新的任务——无监督多场景(UMS)人员重新识别(ReID),该任务旨在通过单一连贯框架扩展ReID在各种场景中的应用,如跨分辨率和着装变化等。为了解决UMS-ReID问题,我们引入了图像文本知识建模(ITKM)——一个三阶段框架,有效地利用了视觉-语言模型的表征能力。 首先,从预训练的CLIP模型开始,该模型包含一个图像编码器和一个文本编码器。在第一阶段,我们在图像编码器中引入场景嵌入,并微调编码器以适应性地利用来自多个场景的知识。第二阶段,我们优化了一组学习到的文本嵌入与第一阶段生成的伪标签相关联,并提出了一个多场景分离损失函数,以便增加跨场景文本表示之间的差异。 在第三阶段,首先引入了群级别和实例级别的异构匹配模块,在每个场景中获取可靠的异构正样本对(例如,同一人的可见图像和红外图像)。接下来,我们提出了一种动态文本表征更新策略,以保持文本和图像监督信号之间的一致性。跨多个场景的实验结果表明,ITKM具有优越性和泛化能力;它不仅超越了现有的特定场景方法,并且通过整合来自多场景的知识提升了整体性能。
https://arxiv.org/abs/2601.11243
Text-to-image person re-identification (TIReID) aims to retrieve person images from a large gallery given free-form textual descriptions. TIReID is challenging due to the substantial modality gap between visual appearances and textual expressions, as well as the need to model fine-grained correspondences that distinguish individuals with similar attributes such as clothing color, texture, or outfit style. To address these issues, we propose DiCo (Disentangled Concept Representation), a novel framework that achieves hierarchical and disentangled cross-modal alignment. DiCo introduces a shared slot-based representation, where each slot acts as a part-level anchor across modalities and is further decomposed into multiple concept blocks. This design enables the disentanglement of complementary attributes (\textit{e.g.}, color, texture, shape) while maintaining consistent part-level correspondence between image and text. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that our framework achieves competitive performance with state-of-the-art methods, while also enhancing interpretability through explicit slot- and block-level representations for more fine-grained retrieval results.
文本到图像的人再识别(TIReID)的目标是从大型图片库中检索出给定自由格式的文字描述所对应的人体图像。由于视觉外观与文字表达之间存在显著的模态差距,以及需要建模精细粒度的对应关系以区分具有类似属性(如服装颜色、纹理或着装风格)的个体,TIReID变得极具挑战性。为了应对这些问题,我们提出了一种新的框架DiCo(分离概念表示),该框架实现了层次化和独立的跨模式对齐。 DiCo引入了基于共享槽位的表示方法,在这种方法中,每个槽作为一个跨模态的部分级锚点,并进一步分解成多个概念块。这种设计使得可以将互补属性(如颜色、纹理、形状)分离出来,同时在图像与文本之间保持一致的部分级对应关系。广泛的实验表明,我们的框架在CUHK-PEDES、ICFG-PEDES和RSTPReid数据集上实现了具有竞争力的性能,并通过显式的槽位及块级别表示提升了模型的可解释性,从而能够提供更精细粒度的检索结果。
https://arxiv.org/abs/2601.10053
We explore a situation in which the target domain is accessible, but real-time data annotation is not feasible. Instead, we would like to construct an alternative training set from a large-scale data server so that a competitive model can be obtained. For this problem, because the target domain usually exhibits distinct modes (i.e., semantic clusters representing data distribution), if the training set does not contain these target modes, the model performance would be compromised. While prior existing works improve algorithms iteratively, our research explores the often-overlooked potential of optimizing the structure of the data server. Inspired by the hierarchical nature of web search engines, we introduce a hierarchical data server, together with a bipartite mode matching algorithm (BMM) to align source and target modes. For each target mode, we look in the server data tree for the best mode match, which might be large or small in size. Through bipartite matching, we aim for all target modes to be optimally matched with source modes in a one-on-one fashion. Compared with existing training set search algorithms, we show that the matched server modes constitute training sets that have consistently smaller domain gaps with the target domain across object re-identification (re-ID) and detection tasks. Consequently, models trained on our searched training sets have higher accuracy than those trained otherwise. BMM allows data-centric unsupervised domain adaptation (UDA) orthogonal to existing model-centric UDA methods. By combining the BMM with existing UDA methods like pseudo-labeling, further improvement is observed.
我们探讨了一种情况,即目标领域可以访问,但实时数据标注不可行。为此,我们希望从大规模的数据服务器中构建一个替代的训练集,从而获得具有竞争力的模型。对于这个问题,因为目标域通常表现出不同的模式(即代表数据分布的语义簇),如果训练集中不包含这些目标模式,那么模型性能将受到影响。 以往的研究工作通过迭代改进算法来解决这一问题,而我们的研究则探索了往往被忽视的数据服务器结构优化潜力。受网络搜索引擎层次结构的启发,我们引入了一个分层数据服务器,并结合双轨模式匹配算法(BMM)以对齐源域和目标域的模式。对于每个目标模式,在服务器数据树中寻找最佳模式匹配,无论其大小如何。通过双轨匹配,我们的目的是让所有目标模式都能与源模式一对一地实现最优匹配。 相较于现有的训练集搜索算法,我们展示了所匹配的服务器模式构成的训练集在跨对象重识别(re-ID)和检测任务中的域差距始终较小。因此,在使用我们搜索到的训练集进行训练后,模型具有更高的准确性。 BMM使数据为中心的无监督领域适应(UDA)方法与现有以模型为中心的UDA方法相辅相成。通过结合BMM和其他现有的UDA方法(如伪标签),进一步改进了性能。
https://arxiv.org/abs/2601.09531
Stylometry--the identification of an author through analysis of a text's style (i.e., authorship attribution)--serves many constructive purposes: it supports copyright and plagiarism investigations, aids detection of harmful content, offers exploratory cues for certain medical conditions (e.g., early signs of dementia or depression), provides historical context for literary works, and helps uncover misinformation and disinformation. In contrast, when stylometry is employed as a tool for authorship verification--confirming whether a text truly originates from a claimed author--it can also be weaponized for malicious purposes. Techniques such as de-anonymization, re-identification, tracking, profiling, and downstream effects like censorship illustrate the privacy threats that stylometric analysis can enable. Building on these concerns, this paper further explores how adversarial stylometry combined with steganography can counteract stylometric analysis. We first present enhancements to our adversarial attack, $\textit{TraceTarnish}$, providing stronger evidence of its capacity to confound stylometric systems and reduce their attribution and verification accuracy. Next, we examine how steganographic embedding can be fine-tuned to mask an author's stylistic fingerprint, quantifying the level of authorship obfuscation achievable as a function of the proportion of words altered with zero-width Unicode characters. Based on our findings, steganographic coverage of 33% or higher seemingly ensures authorship obfuscation. Finally, we reflect on the ways stylometry can be used to undermine privacy and argue for the necessity of defensive tools like $\textit{TraceTarnish}$.
语体测量法——通过分析文本的风格来识别作者(即,确定作者身份)——在许多建设性方面发挥着重要作用:它支持版权和抄袭调查,有助于检测有害内容,为某些医疗状况提供探索线索(例如痴呆症或抑郁症的早期迹象),为文学作品提供历史背景,并帮助揭露错误信息和虚假信息。然而,当语体测量被用作作者身份验证工具——确认文本是否真正来自声称的作者时——它也可能被用于恶意目的。诸如去匿名化、重新识别、跟踪、个人画像以及由此产生的审查等隐私威胁都可能因语体分析而引发。 基于这些担忧,本文进一步探讨了通过结合对抗性语体测量和隐写术来对抗语体分析的方法。首先,我们介绍了对我们的对抗攻击技术$\textit{TraceTarnish}$的改进,提供了更强有力的证据证明其能够混淆语体系统的识别能力,并降低其归因与验证的准确性。接下来,我们将研究如何通过调整隐写术嵌入以掩盖作者的独特风格特征,量化使用零宽度Unicode字符修改的比例对作者身份模糊化程度的影响。根据我们的研究成果,在33%或更高的隐写覆盖范围内,似乎能够确保有效的作者身份混淆。 最后,我们反思了语体测量如何被用于侵犯隐私,并主张需要防御性工具如$\textit{TraceTarnish}$来保护个人免受此类技术带来的风险。
https://arxiv.org/abs/2601.09056
Tracklet quality is often treated as an afterthought in most person re-identification (ReID) methods, with the majority of research presenting architectural modifications to foundational models. Such approaches neglect an important limitation, posing challenges when deploying ReID systems in real-world, difficult scenarios. In this paper, we introduce S3-CLIP, a video super-resolution-based CLIP-ReID framework developed for the VReID-XFD challenge at WACV 2026. The proposed method integrates recent advances in super-resolution networks with task-driven super-resolution pipelines, adapting them to the video-based person re-identification setting. To the best of our knowledge, this work represents the first systematic investigation of video super-resolution as a means of enhancing tracklet quality for person ReID, particularly under challenging cross-view conditions. Experimental results demonstrate performance competitive with the baseline, achieving 37.52% mAP in aerial-to-ground and 29.16% mAP in ground-to-aerial scenarios. In the ground-to-aerial setting, S3-CLIP achieves substantial gains in ranking accuracy, improving Rank-1, Rank-5, and Rank-10 performance by 11.24%, 13.48%, and 17.98%, respectively.
在大多数人物重新识别(ReID)方法中,轨迹质量通常被当作次要问题处理,而多数研究集中在对基础模型进行架构上的改进。这种做法忽视了一个重要的局限性,在实际部署ReID系统时会遇到挑战,尤其是在复杂且困难的场景下。在这篇论文中,我们介绍了S3-CLIP框架,这是一个基于视频超分辨率(Video Super-Resolution)和CLIP-ReID技术相结合的方法,并用于WACV 2026年举办的VReID-XFD竞赛。该方法将最近在超分辨率网络中的进展与任务驱动的超分辨率流程集成起来,适应于基于视频的人物重新识别场景。 据我们所知,这项工作是首次系统性地研究视频超分辨率技术如何提升轨迹质量以增强人物ReID性能的研究,特别是在跨视角(cross-view)条件下。实验结果表明,我们的方法在性能上与基准线相当,并且在从空中到地面和从地面到空中的场景中分别达到了37.52%的mAP和29.16%的mAP。尤其是在地面到空中设置的情况下,S3-CLIP框架显著提高了排名准确度,在Rank-1、Rank-5和Rank-10性能方面分别提升了11.24%,13.48%,以及17.98%。
https://arxiv.org/abs/2601.08807
Accurate individual identification is essential for monitoring rare amphibians, yet invasive marking is often unsuitable for critically endangered species. We evaluate state-of-the-art computer-vision methods for photographic re-identification of the Hula painted frog (Latonia nigriventer) using 1,233 ventral images from 191 individuals collected during 2013-2020 capture-recapture surveys. We compare deep local-feature matching in a zero-shot setting with deep global-feature embedding models. The local-feature pipeline achieves 98% top-1 closed-set identification accuracy, outperforming all global-feature models; fine-tuning improves the best global-feature model to 60% top-1 (91% top-10) but remains below local matching. To combine scalability with accuracy, we implement a two-stage workflow in which a fine-tuned global-feature model retrieves a short candidate list that is re-ranked by local-feature matching, reducing end-to-end runtime from 6.5-7.8 hours to ~38 minutes while maintaining ~96% top-1 closed-set accuracy on the labeled dataset. Separation of match scores between same- and different-individual pairs supports thresholding for open-set identification, enabling practical handling of novel individuals. We deploy this pipeline as a web application for routine field use, providing rapid, standardized, non-invasive identification to support conservation monitoring and capture-recapture analyses. Overall, in this species, zero-shot deep local-feature matching outperformed global-feature embedding and provides a strong default for photo-identification.
准确的个体识别对于监测稀有两栖动物至关重要,然而,侵入性标记方法通常不适用于极度濒危物种。我们评估了最先进的计算机视觉方法,用于2013年至2020年间在捕捉和再捕捉调查中收集到的191只霍拉彩蛙(Latonia nigriventer)的1,233张腹部图像的摄影重新识别。我们在零样本设置下比较了深度局部特征匹配与深度全局特征嵌入模型的效果。局部特征管道达到了98%的第一名封闭集识别准确率,优于所有全局特征模型;微调将最佳全局特征模型改进至60%的第一名(91%的前十名),但仍低于局部匹配的表现。为了结合可扩展性和准确性,我们实施了一个两阶段工作流程,在该流程中,一个微调后的全局特征模型检索出一个简短的候选名单,然后通过局部特征匹配重新排序,将端到端运行时间从6.5至7.8小时减少到了大约38分钟,同时保持了标记数据集上的约96%的第一名封闭集准确率。同源个体与不同个体配对之间的匹配得分差异支持阈值设置以进行开放集识别,从而能够处理新出现的个体。我们将此流程部署为网络应用程序供日常现场使用,提供快速、标准化且非侵入性的识别方式来支持保护监测和捕捉再释放分析。总体而言,在该物种中,零样本深度局部特征匹配优于全局特征嵌入,并提供了照片识别的一个强大的默认选项。
https://arxiv.org/abs/2601.08798