In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at this https URL.
https://arxiv.org/abs/2506.09385
Dense image correspondence is central to many applications, such as visual odometry, 3D reconstruction, object association, and re-identification. Historically, dense correspondence has been tackled separately for wide-baseline scenarios and optical flow estimation, despite the common goal of matching content between two images. In this paper, we develop a Unified Flow & Matching model (UFM), which is trained on unified data for pixels that are co-visible in both source and target images. UFM uses a simple, generic transformer architecture that directly regresses the (u,v) flow. It is easier to train and more accurate for large flows compared to the typical coarse-to-fine cost volumes in prior work. UFM is 28% more accurate than state-of-the-art flow methods (Unimatch), while also having 62% less error and 6.7x faster than dense wide-baseline matchers (RoMa). UFM is the first to demonstrate that unified training can outperform specialized approaches across both domains. This result enables fast, general-purpose correspondence and opens new directions for multi-modal, long-range, and real-time correspondence tasks.
密集图像对应关系在许多应用中至关重要,例如视觉测距、三维重建、物体关联和重新识别。历史上,对于宽基线场景下的稠密对应关系与光流估计问题通常是分开处理的,尽管它们都旨在匹配两张图片之间的内容。在这篇文章中,我们开发了一种统一流动及匹配模型(UFM),该模型在源图像和目标图像中共视像素上进行统一流数据训练。UFM使用了一个简单的、通用的变压器架构,直接回归(u,v)流。相比先前工作中的典型粗到细成本体积方法,UFM更容易训练,并且对于大范围的流动更准确。 与最先进的流方法(Unimatch)相比,UFM精确度高出28%;同时,与稠密宽基线匹配器(RoMa)相比,它的误差减少了62%,运行速度提高了6.7倍。UFM首次证明了统一训练方法可以在两个领域内超越专业化的方法。这一成果使得快速、通用的对应关系成为可能,并为跨模态、长距离和实时对应的任务开辟了新的方向。
https://arxiv.org/abs/2506.09278
Vision Transformers (ViTs) have demonstrated impressive performance across a wide range of biometric tasks, including face and body recognition. In this work, we adapt a ViT model pretrained on visible (VIS) imagery to the challenging problem of cross-spectral body recognition, which involves matching images captured in the visible and infrared (IR) domains. Recent ViT architectures have explored incorporating additional embeddings beyond traditional positional embeddings. Building on this idea, we integrate Side Information Embedding (SIE) and examine the impact of encoding domain and camera information to enhance cross-spectral matching. Surprisingly, our results show that encoding only camera information - without explicitly incorporating domain information - achieves state-of-the-art performance on the LLCM dataset. While occlusion handling has been extensively studied in visible-spectrum person re-identification (Re-ID), occlusions in visible-infrared (VI) Re-ID remain largely underexplored - primarily because existing VI-ReID datasets, such as LLCM, SYSU-MM01, and RegDB, predominantly feature full-body, unoccluded images. To address this gap, we analyze the impact of range-induced occlusions using the IARPA Janus Benchmark Multi-Domain Face (IJB-MDF) dataset, which provides a diverse set of visible and infrared images captured at various distances, enabling cross-range, cross-spectral evaluations.
视觉变压器(ViT)在包括面部和身体识别在内的多种生物特征任务中表现出色。在这项工作中,我们将一个预训练于可见光图像的ViT模型应用于跨谱系身体识别这一具有挑战性的问题上,该问题涉及到匹配可视域和红外域捕获的图像。最近的ViT架构探索了在传统位置嵌入之外添加额外嵌入的方法。在此基础上,我们整合了侧信息嵌入(SIE),并研究编码领域和相机信息对跨谱系匹配的影响。令人惊讶的是,我们的结果显示,在LLCM数据集上,仅通过编码相机信息——而无需明确地将领域信息纳入考虑内——即可达到最先进的性能。 尽管在可见光谱范围内的人员重新识别(Re-ID)中已经广泛研究了遮挡处理方法,但针对可见-红外(VI)Re-ID中的遮挡问题尚未得到充分探索。这主要是因为现有的VI-ReID数据集(如LLCM、SYSU-MM01和RegDB),主要包含的是全身且未被遮挡的图像。 为了解决这一差距,我们利用IARPA Janus基准多域面部识别(IJB-MDF)数据集分析了由范围引起的遮挡影响。该数据集提供了一组多样化的可见光和红外图像,在不同距离下拍摄,支持跨范围、跨谱系评估。
https://arxiv.org/abs/2506.08953
Collaborative perception plays a crucial role in enhancing environmental understanding by expanding the perceptual range and improving robustness against sensor failures, which primarily involves collaborative 3D detection and tracking tasks. The former focuses on object recognition in individual frames, while the latter captures continuous instance tracklets over time. However, existing works in both areas predominantly focus on the vehicle superclass, lacking effective solutions for both multi-class collaborative detection and tracking. This limitation hinders their applicability in real-world scenarios, which involve diverse object classes with varying appearances and motion patterns. To overcome these limitations, we propose a multi-class collaborative detection and tracking framework tailored for diverse road users. We first present a detector with a global spatial attention fusion (GSAF) module, enhancing multi-scale feature learning for objects of varying sizes. Next, we introduce a tracklet RE-IDentification (REID) module that leverages visual semantics with a vision foundation model to effectively reduce ID SWitch (IDSW) errors, in cases of erroneous mismatches involving small objects like pedestrians. We further design a velocity-based adaptive tracklet management (VATM) module that adjusts the tracking interval dynamically based on object motion. Extensive experiments on the V2X-Real and OPV2V datasets show that our approach significantly outperforms existing state-of-the-art methods in both detection and tracking accuracy.
合作感知在通过扩展感知范围和提高传感器故障下的鲁棒性来增强环境理解方面起着关键作用,这主要涉及协作3D检测和跟踪任务。前者侧重于识别单帧中的物体,而后者则随着时间的推移捕捉连续的对象轨迹。然而,目前在这两个领域的研究大多集中在车辆这一超类上,缺乏有效的多类别协同检测和跟踪解决方案。这种局限性限制了它们在现实世界场景中的应用能力,因为这些场景中涉及具有不同外观和运动模式的各种对象类别。 为克服上述限制,我们提出了一种面向多样化道路使用者的多类别协作检测与跟踪框架。首先,我们介绍了一个配备全局空间注意力融合(GSAF)模块的检测器,该模块增强了对各种尺寸物体的多尺度特征学习能力。接下来,我们引入了轨迹重新识别(REID)模块,利用视觉语义和视觉基础模型来有效减少涉及行人等小对象时的身份切换错误。此外,我们设计了一个基于速度自适应轨迹管理(VATM)模块,能够根据目标运动动态调整跟踪间隔。 在V2X-Real和OPV2V数据集上的广泛实验表明,我们的方法在检测和跟踪准确率方面显著优于现有的最先进方法。
https://arxiv.org/abs/2506.07375
We introduce CzechLynx, the first large-scale, open-access dataset for individual identification, 2D pose estimation, and instance segmentation of the Eurasian lynx (Lynx lynx). CzechLynx includes more than 30k camera trap images annotated with segmentation masks, identity labels, and 20-point skeletons and covers 219 unique individuals across 15 years of systematic monitoring in two geographically distinct regions: Southwest Bohemia and the Western Carpathians. To increase the data variability, we create a complementary synthetic set with more than 100k photorealistic images generated via a Unity-based pipeline and diffusion-driven text-to-texture modeling, covering diverse environments, poses, and coat-pattern variations. To allow testing generalization across spatial and temporal domains, we define three tailored evaluation protocols/splits: (i) geo-aware, (ii) time-aware open-set, and (iii) time-aware closed-set. This dataset is targeted to be instrumental in benchmarking state-of-the-art models and the development of novel methods for not just individual animal re-identification.
我们介绍了CzechLynx,这是第一个用于欧亚 lynx(Lynx lynx)个体识别、2D姿态估计和实例分割的大型开放访问数据集。CzechLynx 包括超过 30,000 张相机陷阱图像,并附有标注的分割掩码、身份标签以及包含 20 个关键点的骨架信息,涵盖了在两个地理上不同的地区(西南波希米亚和西喀尔巴阡山脉)长达15年的系统监测中发现的219只独特的个体。为了增加数据的变化性,我们创建了一个补充的合成数据集,其中包括超过10万张通过基于 Unity 的管道生成的真实感图像以及基于扩散驱动的文字到纹理建模方式生成的图像,这些图像涵盖了多样化的环境、姿态和毛皮图案变化。 为测试在空间和时间域中的泛化能力,我们定义了三种定制评估协议/数据集划分:(i) 地理感知型;(ii) 时间感知型开放集;以及(iii) 时间感知型封闭集。该数据集旨在成为衡量最先进模型及开发新型方法(不仅仅是动物个体重新识别)的基准测试的重要工具。
https://arxiv.org/abs/2506.04931
Person Re-Identification (Re-ID) is a very important task in video surveillance systems such as tracking people, finding people in public places, or analysing customer behavior in supermarkets. Although there have been many works to solve this problem, there are still remaining challenges such as large-scale datasets, imbalanced data, viewpoint, fine grained data (attributes), the Local Features are not employed at semantic level in online stage of Re-ID task, furthermore, the imbalanced data problem of attributes are not taken into consideration. This paper has proposed a Unified Re-ID system consisted of three main modules such as Pedestrian Attribute Ontology (PAO), Local Multi-task DCNN (Local MDCNN), Imbalance Data Solver (IDS). The new main point of our Re-ID system is the power of mutual support of PAO, Local MDCNN and IDS to exploit the inner-group correlations of attributes and pre-filter the mismatch candidates from Gallery set based on semantic information as Fashion Attributes and Facial Attributes, to solve the imbalanced data of attributes without adjusting network architecture and data augmentation. We experimented on the well-known Market1501 dataset. The experimental results have shown the effectiveness of our Re-ID system and it could achieve the higher performance on Market1501 dataset in comparison to some state-of-the-art Re-ID methods.
人重新识别(Re-ID)是视频监控系统中的一个非常重要的任务,例如跟踪人员、在公共场所寻找人员或分析超市中顾客的行为。尽管已经有许多研究致力于解决这一问题,但仍存在一些挑战,如大规模数据集的处理、数据不平衡、视角变化以及细粒度数据(属性)等问题。此外,在人重新识别任务的在线阶段,局部特征并未被用于语义层面的应用,同时,属性的数据不平衡问题也未得到充分考虑。 本文提出了一种统一的人重新识别系统,该系统由三个主要模块组成:行人属性本体论(PAO)、局部多任务深度卷积神经网络(Local MDCNN)和数据不平衡求解器(IDS)。我们的人重新识别系统的创新点在于,这三个模块——即PAO、Local MDCNN 和 IDS 能够相互支持,通过利用属性之间的内部分组相关性,并基于语义信息如时尚属性和面部属性对Gallery集合中的不匹配候选者进行预筛选,以解决属性数据不平衡问题而不调整网络架构或数据增强。 我们在著名的Market1501数据集上进行了实验。实验结果表明了我们的人重新识别系统的效果显著,在Market1501数据集上的性能优于一些最新的Re-ID方法。
https://arxiv.org/abs/2506.04143
Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem. To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal prompting (STP). IMLP employs a joint fine-tuning strategy for the visual encoder and the prompt learner to effectively generate modality-shared text prompts and align them with visual features from different modalities in CLIP's multimodal space, thereby mitigating modality differences. Additionally, STP models spatiotemporal information through two submodules, the spatial-temporal hub (STH) and spatial-temporal aggregation (STA), which further enhance IMLP by incorporating spatiotemporal information into text prompts. The STH aggregates and diffuses spatiotemporal information into the [CLS] token of each frame across the vision transformer (ViT) layers, whereas STA introduces dedicated identity-level loss and specialized multihead attention to ensure that the STH focuses on identity-relevant spatiotemporal feature aggregation. The VLD framework achieves state-of-the-art results on two VVI-ReID benchmarks. The code will be released at this https URL.
基于视频的可见光-红外行人再识别(VVI-ReID)旨在通过提取模态不变的序列级特征来匹配不同模态下的行人序列。作为一种高级语义表示,语言为可见光和红外两种模态下的人行特性提供了一致性的描述。理论上,利用对比学习的语言图像预训练(CLIP)模型生成视频级别的文本提示,并指导模态不变序列级别特征的学习是可行的。然而,如何生成并使用跨模态共享的视频级语言提示以解决不同模态之间的差距问题仍是一个关键挑战。 为了解决这一问题,我们提出了一种简单而强大的框架——视频级别语言驱动的VVI-ReID(VLD),该框架由两个核心模块组成:不变性模态语言提示(IMLP)和时空提示(STP)。IMLP采用视觉编码器和提示学习者的联合微调策略,有效生成跨模态共享的文字提示,并在CLIP的多模态空间中使其与不同模态下的视觉特征对齐,从而缓解了模态差异。此外,通过空间-时间枢纽(STH)和时空聚合(STA)这两个子模块,STP建模空间-时间信息,并进一步通过将空间-时间信息融入文本提示来增强IMLP。 具体来说,STH在每个帧的[CLS]标记上跨视觉变换器(ViT)层进行空间-时间信息的聚集和扩散操作;而STA则引入了专门的身份级别损失以及专业化的多头注意力机制,以确保STH专注于身份相关的时间空间特征聚合。VLD框架在两个VVI-ReID基准测试中取得了最先进的结果。 有关此项目的代码将在此 URL 发布:[请参阅原文链接]
https://arxiv.org/abs/2506.02439
The increasing popularity of egocentric cameras has generated growing interest in studying multi-camera interactions in shared environments. Although large-scale datasets such as Ego4D and Ego-Exo4D have propelled egocentric vision research, interactions between multiple camera wearers remain underexplored-a key gap for applications like immersive learning and collaborative robotics. To bridge this, we present TF2025, an expanded dataset with synchronized first- and third-person views. In addition, we introduce a sequence-based method to identify first-person wearers in third-person footage, combining motion cues and person re-identification.
第一人称相机日益普及,引发了人们对共享环境中多相机交互研究的兴趣不断增加。尽管大规模数据集如Ego4D和Ego-Exo4D推动了第一人称视觉研究的发展,但多个佩戴者之间的互动仍被严重忽视——这对沉浸式学习和协作机器人等应用来说是一个关键缺口。为了弥补这一不足,我们推出了TF2025,这是一个扩展的数据集,包含了同步的第一人称和第三人称视角视图。此外,我们还介绍了一种基于序列的方法来识别第三人称视频中佩戴第一人称相机的人,该方法结合了运动线索和个人重识别技术。
https://arxiv.org/abs/2506.00394
In this paper, we leverage the advantages of event cameras to resist harsh lighting conditions, reduce background interference, achieve high time resolution, and protect facial information to study the long-sequence event-based person re-identification (Re-ID) task. To this end, we propose a simple and efficient long-sequence event Re-ID model, namely the Spike-guided Spatiotemporal Semantic Coupling and Expansion Network (S3CE-Net). To better handle asynchronous event data, we build S3CE-Net based on spiking neural networks (SNNs). The S3CE-Net incorporates the Spike-guided Spatial-temporal Attention Mechanism (SSAM) and the Spatiotemporal Feature Sampling Strategy (STFS). The SSAM is designed to carry out semantic interaction and association in both spatial and temporal dimensions, leveraging the capabilities of SNNs. The STFS involves sampling spatial feature subsequences and temporal feature subsequences from the spatiotemporal dimensions, driving the Re-ID model to perceive broader and more robust effective semantics. Notably, the STFS introduces no additional parameters and is only utilized during the training stage. Therefore, S3CE-Net is a low-parameter and high-efficiency model for long-sequence event-based person Re-ID. Extensive experiments have verified that our S3CE-Net achieves outstanding performance on many mainstream long-sequence event-based person Re-ID datasets. Code is available at:this https URL.
在这篇论文中,我们利用事件相机的优势来抵抗恶劣的光照条件、减少背景干扰、实现高时间分辨率,并保护面部信息,以研究基于长时间序列事件的人重识别(Re-ID)任务。为此,我们提出了一种简单而高效的长时间序列事件 Re-ID 模型,即 Spike-guided Spatiotemporal Semantic Coupling and Expansion Network (S3CE-Net)。为了更好地处理异步事件数据,我们在脉冲神经网络(SNNs)的基础上构建了 S3CE-Net。S3CE-Net 集成了由 SSAM 和 STFS 构成的组件:即 Spike-guided Spatial-temporal Attention Mechanism (SSAM) 以及 Spatiotemporal Feature Sampling Strategy (STFS)。SSAM 被设计用于在空间和时间两个维度上进行语义交互与关联,利用 SNNs 的能力实现这一目标。而 STFS 则涉及从时空维度中采样空间特征子序列和时间特征子序列,促使 Re-ID 模型能够感知到更广泛且更具鲁棒性的有效语义信息。值得注意的是,STFS 不会引入额外的参数,并仅在训练阶段使用。因此,S3CE-Net 是一种低参数、高效率的长时间序列事件基于的人重识别模型。广泛的实验验证了我们的 S3CE-Net 在许多主流的长时间序列事件基础的人重识别数据集上取得了卓越的表现。代码可在此处获得:[this https URL]。
https://arxiv.org/abs/2505.24401
Video-based person re-identification (Re-ID) remains brittle in real-world deployments despite impressive benchmark performance. Most existing models rely on superficial correlations such as clothing, background, or lighting that fail to generalize across domains, viewpoints, and temporal variations. This survey examines the emerging role of causal reasoning as a principled alternative to traditional correlation-based approaches in video-based Re-ID. We provide a structured and critical analysis of methods that leverage structural causal models, interventions, and counterfactual reasoning to isolate identity-specific features from confounding factors. The survey is organized around a novel taxonomy of causal Re-ID methods that spans generative disentanglement, domain-invariant modeling, and causal transformers. We review current evaluation metrics and introduce causal-specific robustness measures. In addition, we assess practical challenges of scalability, fairness, interpretability, and privacy that must be addressed for real-world adoption. Finally, we identify open problems and outline future research directions that integrate causal modeling with efficient architectures and self-supervised learning. This survey aims to establish a coherent foundation for causal video-based person Re-ID and to catalyze the next phase of research in this rapidly evolving domain.
基于视频的人再识别(Re-ID)在实际部署中尽管基准性能表现出色,但依然脆弱。大多数现有模型依赖于肤浅的相关性,如服装、背景或光照等特征,这些特征无法跨领域、视角和时间变化进行泛化。本文综述探讨了因果推理作为传统相关性方法原则替代方案的新兴角色,在基于视频的人再识别中发挥作用。我们提供了对利用结构因果模型、干预措施和反事实推理的方法进行结构化且批判性的分析,以从混淆因素中隔离身份特定特征。本综述围绕新颖的因果Re-ID方法分类组织展开,该分类涵盖了生成性解缠、域不变建模以及因果变压器等领域。此外,我们回顾了当前评估指标,并引入了针对因果特性的鲁棒性度量。同时,我们也评估了实现规模扩展、公平性、可解释性和隐私等实际挑战,这些对于现实世界的采用至关重要。最后,本文确定了开放问题并概述了未来研究方向,旨在将因果建模与高效架构及自监督学习相结合。 该综述旨在为基于视频的人再识别的因果方法建立一个连贯的基础,并推动这一迅速发展的领域的下一阶段研究。
https://arxiv.org/abs/2505.20540
Multi-modal object re-identification (ReID) aims to extract identity features across heterogeneous spectral modalities to enable accurate recognition and retrieval in complex real-world scenarios. However, most existing methods rely on implicit feature fusion structures, making it difficult to model fine-grained recognition strategies under varying challenging conditions. Benefiting from the powerful semantic understanding capabilities of Multi-modal Large Language Models (MLLMs), the visual appearance of an object can be effectively translated into descriptive text. In this paper, we propose a reliable multi-modal caption generation method based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs in multi-modal semantic generation and improves the quality of generated text. Additionally, we propose a novel ReID framework NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural expert branches to separately capture modality-specific appearance and intrinsic structure. For semantic recognition, we propose the Text-Modulated Semantic-sampling Experts (TMSE), which leverages randomly sampled high-quality semantic texts to modulate expert-specific sampling of multi-modal features and mining intra-modality fine-grained semantic cues. Then, to recognize coarse-grained structure features, we propose the Context-Shared Structure-aware Experts (CSSE) that focuses on capturing the holistic object structure across modalities and maintains inter-modality structural consistency through a soft routing mechanism. Finally, we propose the Multi-Modal Feature Aggregation (MMFA), which adopts a unified feature fusion strategy to simply and effectively integrate semantic and structural expert outputs into the final identity representations.
多模态目标重识别(ReID)旨在提取跨异构光谱模式的身份特征,以实现在复杂现实场景中准确地进行识别和检索。然而,大多数现有方法依赖于隐式特征融合结构,在变化的挑战条件下难以建模细粒度识别策略。得益于多模态大型语言模型(MLLMs)强大的语义理解能力,可以有效地将物体的视觉外观转换为描述性文本。在本文中,我们提出了一种基于属性置信度的可靠多模态文字生成方法,这显著降低了MLLMs在多模态语义生成中的未知识别率,并提高了生成文本的质量。 此外,我们提出了一个新颖的ReID框架NEXT(通过文本调制实现的多层次专家混合模型用于多模态目标重识别)。具体来说,我们将识别问题分解为语义和结构分支,以分别捕捉特定于模式的外观以及内在结构。对于语义识别,我们提出了一种基于文本调制的语义采样专家(TMSE),该方法利用随机抽取的高质量语义文本来调制特定于专家的多模态特征采样,并挖掘跨模式内的细粒度语义线索。 然后,为了识别粗粒度结构特征,我们提出了上下文共享结构感知专家(CSSE),这专注于捕捉跨模式的整体对象结构并通过软路由机制保持跨模式间的结构性一致。 最后,我们提出了一种多模态特征聚合(MMFA)方法,采用统一的特征融合策略来简单有效地将语义和结构专家输出整合到最终的身份表示中。
https://arxiv.org/abs/2505.20001
Multi-view multi-object tracking (MVMOT) has found widespread applications in intelligent transportation, surveillance systems, and urban management. However, existing studies rarely address genuinely free-viewpoint MVMOT systems, which could significantly enhance the flexibility and scalability of cooperative tracking systems. To bridge this gap, we first construct the Multi-Drone Multi-Object Tracking (MDMOT) dataset, captured by mobile drone swarms across diverse real-world scenarios, initially establishing the first benchmark for multi-object tracking in arbitrary multi-view environment. Building upon this foundation, we propose \textbf{FusionTrack}, an end-to-end framework that reasonably integrates tracking and re-identification to leverage multi-view information for robust trajectory association. Extensive experiments on our MDMOT and other benchmark datasets demonstrate that FusionTrack achieves state-of-the-art performance in both single-view and multi-view tracking.
多视角多目标跟踪(MVMOT)在智能交通、监控系统和城市管理系统中得到了广泛应用。然而,现有研究很少涉及真正的自由视点MVMOT系统,这可以显著提高合作跟踪系统的灵活性和可扩展性。为填补这一空白,我们首先构建了由移动无人机群在各种现实场景下捕捉的Multi-Drone Multi-Object Tracking(MDMOT)数据集,初步建立了任意多视角环境下多目标跟踪的第一个基准。在此基础上,我们提出了\textbf{FusionTrack},这是一个端到端框架,合理地整合了跟踪和重识别功能,以利用多视图信息进行稳健的轨迹关联。在我们的MDMOT和其他基准数据集上进行了广泛的实验表明,FusionTrack在单视图和多视图跟踪方面均达到了最先进的性能。
https://arxiv.org/abs/2505.18727
Person re-identification (ReID) models are known to suffer from camera bias, where learned representations cluster according to camera viewpoints rather than identity, leading to significant performance degradation under (inter-camera) domain shifts in real-world surveillance systems when new cameras are added to camera networks. State-of-the-art test-time adaptation (TTA) methods, largely designed for classification tasks, rely on classification entropy-based objectives that fail to generalize well to ReID, thus making them unsuitable for tackling camera bias. In this paper, we introduce DART$^3$, a TTA framework specifically designed to mitigate camera-induced domain shifts in person ReID. DART$^3$ (Distance-Aware Retrieval Tuning at Test Time) leverages a distance-based objective that aligns better with image retrieval tasks like ReID by exploiting the correlation between nearest-neighbor distance and prediction error. Unlike prior ReID-specific domain adaptation methods, DART$^3$ requires no source data, architectural modifications, or retraining, and can be deployed in both fully black-box and hybrid settings. Empirical evaluations on multiple ReID benchmarks indicate that DART$^3$ and DART$^3$ LITE, a lightweight alternative to the approach, consistently outperforms state-of-the-art TTA baselines, making for a viable option to online learning to mitigate the adverse effects of camera bias.
人员重新识别(ReID)模型在面对摄像机偏差时会表现不佳,这种情况下,学习到的表示方式根据摄像机视角进行聚类,而不是按身份分组。这导致了当现实世界的监控系统中加入新的摄像头(跨摄像机领域转移)时,性能显著下降。现有的测试时间适应(TTA, Test-Time Adaptation)方法主要是为分类任务设计的,依赖于基于分类熵的目标函数,这些目标函数无法很好地推广到ReID任务上,因此不适合解决摄像机偏差问题。 在本文中,我们提出了DART$^3$,这是一种专门用于缓解人员重新识别中的相机引起的领域偏移的测试时间适应框架。DART$^3$(Distance-Aware Retrieval Tuning at Test Time)采用了一个基于距离的目标函数,该目标函数更适合于像ReID这样的图像检索任务,它通过利用最近邻距离和预测误差之间的相关性来实现这一点。 与以前专门针对ReID领域的领域适应方法不同,DART$^3$不需要源数据、架构修改或重新训练,并且可以在全黑盒和混合设置中部署。在多个ReID基准测试上的实证评估表明,DART$^3$及其轻量级变体DART$^3$ LITE始终优于现有的TTA基线方法,这使得它成为缓解摄像机偏差不良影响的在线学习的一种可行选择。
https://arxiv.org/abs/2505.18337
Multi-spectral object re-identification (ReID) brings a new perception perspective for smart city and intelligent transportation applications, effectively addressing challenges from complex illumination and adverse weather. However, complex modal differences between heterogeneous spectra pose challenges to efficiently utilizing complementary and discrepancy of spectra information. Most existing methods fuse spectral data through intricate modal interaction modules, lacking fine-grained semantic understanding of spectral information (\textit{e.g.}, text descriptions, part masks, and object keypoints). To solve this challenge, we propose a novel Identity-Conditional text Prompt Learning framework (ICPL), which exploits the powerful cross-modal alignment capability of CLIP, to unify different spectral visual features from text semantics. Specifically, we first propose the online prompt learning using learnable text prompt as the identity-level semantic center to bridge the identity semantics of different spectra in online manner. Then, in lack of concrete text descriptions, we propose the multi-spectral identity-condition module to use identity prototype as spectral identity condition to constraint prompt learning. Meanwhile, we construct the alignment loop mutually optimizing the learnable text prompt and spectral visual encoder to avoid online prompt learning disrupting the pre-trained text-image alignment distribution. In addition, to adapt to small-scale multi-spectral data and mitigate style differences between spectra, we propose multi-spectral adapter that employs a low-rank adaption method to learn spectra-specific features. Comprehensive experiments on 5 benchmarks, including RGBNT201, Market-MM, MSVR310, RGBN300, and RGBNT100, demonstrate that the proposed method outperforms the state-of-the-art methods.
多光谱对象重新识别(ReID)为智慧城市和智能交通应用带来了新的感知视角,有效地解决了复杂照明条件和恶劣天气带来的挑战。然而,不同异质光谱之间的复杂模式差异给高效利用互补和差异的光谱信息带来了挑战。大多数现有的方法通过复杂的模态交互模块融合光谱数据,缺乏对光谱信息的细粒度语义理解(例如文本描述、部分掩码和对象关键点)。为了解决这一问题,我们提出了一种新颖的身份条件文本提示学习框架(ICPL),该框架利用CLIP的强大跨模式对齐能力,将不同光谱的视觉特征统一到文本语义中。具体来说,我们首先提出了在线提示学习方法,使用可学习的文本提示作为身份级语义中心,在线连接不同光谱的身份语义。然后,在缺乏具体的文本描述的情况下,我们提出了多光谱身份条件模块,利用身份原型作为光谱身份条件来约束提示学习。同时,为了解决小规模多光谱数据适应性问题并减轻光谱之间的风格差异,我们提出了一种多光谱适配器,采用低秩调整方法来学习特定于每个光谱的特征。在包括RGBNT201、Market-MM、MSVR310、RGBN300和RGBNT100在内的5个基准数据集上的全面实验表明,所提出的方法优于最先进的方法。 翻译成中文: 多光谱对象重识别(ReID)为智能城市和智能交通应用提供了新的感知视角,有效地解决了复杂照明条件和恶劣天气带来的挑战。然而,不同异构光谱之间的复杂模态差异给高效利用互补且有差别的光谱信息带来了挑战。大多数现有的方法通过复杂的模态交互模块融合光谱数据,缺乏对光谱信息的细粒度语义理解(例如文本描述、部分掩码和对象关键点)。为了解决这一问题,我们提出了一种新颖的身份条件文本提示学习框架(ICPL),该框架利用CLIP的强大跨模态对齐能力,将不同光谱视觉特征统一到文本语义中。具体来说,我们首先提出了在线提示学习方法,使用可学的文本提示作为身份级语义中心,在线连接不同光谱的身份语义。然后,在缺乏具体的文本描述的情况下,我们提出了一种多光谱身份条件模块,利用身份原型作为光谱身份条件来约束提示学习。与此同时,为了适应小规模多光谱数据并减轻光谱之间的风格差异,我们提出了一个使用低秩适配方法来学习特定于每个光谱特征的多光谱适配器。在包括RGBNT201、Market-MM、MSVR310、RGBN300和RGBNT100在内的五个基准数据集上的全面实验表明,所提出的方法优于现有的最先进的方法。
https://arxiv.org/abs/2505.17821
Software logs are messages recorded during the execution of a software system that provide crucial run-time information about events and activities. Although software logs have a critical role in software maintenance and operation tasks, publicly accessible log datasets remain limited, hindering advance in log analysis research and practices. The presence of sensitive information, particularly Personally Identifiable Information (PII) and quasi-identifiers, introduces serious privacy and re-identification risks, discouraging the publishing and sharing of real-world logs. In practice, log anonymization techniques primarily rely on regular expression patterns, which involve manually crafting rules to identify and replace sensitive information. However, these regex-based approaches suffer from significant limitations, such as extensive manual efforts and poor generalizability across diverse log formats and datasets. To mitigate these limitations, we introduce SDLog, a deep learning-based framework designed to identify sensitive information in software logs. Our results show that SDLog overcomes regex limitations and outperforms the best-performing regex patterns in identifying sensitive information. With only 100 fine-tuning samples from the target dataset, SDLog can correctly identify 99.5% of sensitive attributes and achieves an F1-score of 98.4%. To the best of our knowledge, this is the first deep learning alternative to regex-based methods in software log anonymization.
软件日志是在软件系统执行过程中记录的消息,提供了关于事件和活动的重要运行时信息。尽管软件日志在软件维护和操作任务中起着关键作用,但公开可访问的日志数据集仍然有限,这阻碍了日志分析研究和技术的进步。特别是个人身份信息(PII)和其他准标识符的存在引入了严重的隐私和重新识别风险,从而抑制了真实世界日志的发布与共享。 在实践中,日志匿名化技术主要依赖于正则表达式模式,这种方法需要手动制定规则来识别并替换敏感信息。然而,基于正则表达式的这种方法存在显著局限性,包括耗时的手动工作量以及对不同格式和数据集的泛化能力较差的问题。为了克服这些限制,我们引入了SDLog框架,这是一个基于深度学习的方法,旨在从软件日志中识别敏感信息。 我们的研究结果显示,与现有的正则表达式方法相比,SDLog能够超越其局限性,并在识别敏感信息方面表现优异。仅需使用目标数据集中的100个微调样本,SDLog就能正确地识别出99.5%的敏感属性,并达到F1分数为98.4%的成绩。据我们所知,这是首个针对软件日志匿名化中基于正则表达式方法的深度学习替代方案。
https://arxiv.org/abs/2505.14976
Documents revealing sensitive information about individuals must typically be de-identified. This de-identification is often done by masking all mentions of personally identifiable information (PII), thereby making it more difficult to uncover the identity of the person(s) in question. To investigate the robustness of de-identification methods, we present a novel, RAG-inspired approach that attempts the reverse process of re-identification based on a database of documents representing background knowledge. Given a text in which personal identifiers have been masked, the re-identification proceeds in two steps. A retriever first selects from the background knowledge passages deemed relevant for the re-identification. Those passages are then provided to an infilling model which seeks to infer the original content of each text span. This process is repeated until all masked spans are replaced. We evaluate the re-identification on three datasets (Wikipedia biographies, court rulings and clinical notes). Results show that (1) as many as 80% of de-identified text spans can be successfully recovered and (2) the re-identification accuracy increases along with the level of background knowledge.
包含个人敏感信息的文件通常必须进行去标识化处理。这一过程通常是通过屏蔽所有提及的身份识别信息(PII)来完成,从而使得确定相关人员身份变得更加困难。为了评估去标识化方法的有效性,我们提出了一种基于RAG(Retriever-Augmented Generation)启发的新方法,该方法试图根据包含背景知识的文档数据库执行反向标识过程即重新识别。给定一个已经屏蔽个人标识符的文本,在重新识别过程中会分两步进行:首先由检索器从背景知识中选择相关片段;然后将这些片段提供给填充模型,以推断出每个被遮盖文本片段的原始内容。这一过程重复进行直到所有被遮盖的部分都被替换为止。 我们在三个数据集上评估了重新识别效果(维基百科传记、法院裁决和临床记录)。结果显示:(1)多达80%的去标识化文本片段可以成功恢复;(2)随着背景知识水平的提高,重新识别准确性也有所提升。
https://arxiv.org/abs/2505.12859
This work focuses on Clothes Changing Re-IDentification (CC-ReID) for the real world. Existing works perform well with high-quality (HQ) images, but struggle with low-quality (LQ) where we can have artifacts like pixelation, out-of-focus blur, and motion blur. These artifacts introduce noise to not only external biometric attributes (e.g. pose, body shape, etc.) but also corrupt the model's internal feature representation. Models usually cluster LQ image features together, making it difficult to distinguish between them, leading to incorrect matches. We propose a novel framework Robustness against Low-Quality (RLQ) to improve CC-ReID model on real-world data. RLQ relies on Coarse Attributes Prediction (CAP) and Task Agnostic Distillation (TAD) operating in alternate steps in a novel training mechanism. CAP enriches the model with external fine-grained attributes via coarse predictions, thereby reducing the effect of noisy inputs. On the other hand, TAD enhances the model's internal feature representation by bridging the gap between HQ and LQ features, via an external dataset through task-agnostic self-supervision and distillation. RLQ outperforms the existing approaches by 1.6%-2.9% Top-1 on real-world datasets like LaST, and DeepChange, while showing consistent improvement of 5.3%-6% Top-1 on PRCC with competitive performance on LTCC. *The code will be made public soon.*
这项工作专注于现实世界中的衣物更换重识别(CC-ReID)。现有方法在高质量(HQ)图像上表现出色,但在低质量(LQ)图像中则面临挑战,这些图像可能包含诸如像素化、焦点模糊和运动模糊等缺陷。这些问题不仅会影响外部生物特征属性(如姿态、体型等),还会破坏模型的内部特征表示。因此,模型通常将低质量图像的特征聚类在一起,这使得难以区分它们,从而导致错误匹配。 我们提出了一种名为鲁棒性低质量框架(RLQ)的新方法,以改进现实世界数据中的CC-ReID模型性能。RLQ依赖于粗略属性预测(CAP)和任务无关蒸馏(TAD),这两种技术交替在一种新的训练机制中运行。CAP通过粗略预测为模型提供了细粒度的外部属性信息,从而减少了噪声输入的影响。另一方面,TAD通过使用外部数据集并通过无任务自监督和蒸馏方法弥合高质量与低质量特征之间的差距,增强了模型的内部特征表示。 在现实世界的数据集中,如LaST、DeepChange,RLQ比现有方法提高了1.6%-2.9%的Top-1精度。同时,在PRCC数据集上显示了5.3%-6% Top-1的一致改进,并且在LTCC上的性能也具有竞争力。*代码将很快公开发布。*
https://arxiv.org/abs/2505.12580
Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when compared to a non-fine-tuned model.
视觉叙事系统在维护角色身份以及将动作与适当主体关联方面面临挑战,这常常导致指代幻觉。这些问题可以通过基于视觉元素对人物、物体和其他实体进行定位来解决。我们提出了StoryReasoning数据集,包含从52,016张电影图片中提取的4,178个故事,并提供了结构化的场景分析和基于视觉定位的故事文本。每个故事在不同帧之间保持了角色和对象的一致性,并通过结构化表格表示显式地建模多帧之间的关系。 我们的方法包括利用视觉相似度和面部识别进行跨帧对象重新识别,链式思维推理用于明确叙述模型构建,以及将文本元素链接到多个帧中的视觉实体的定位方案。我们通过对Qwen2.5-VL 7B进行微调来建立基线性能,创建了Qwen讲故事模型,该模型在故事中保持一致的对象引用的同时执行端到端的对象检测、重新识别和地标检测。评估结果表明,与未经微调的模型相比,平均每个故事中的幻觉减少了12.3%(从4.06减少到3.56)。
https://arxiv.org/abs/2505.10292
In this paper, we propose a novel attention module termed the Differentiable Channel Selection Attention module, or the DCS-Attention module. In contrast with conventional self-attention, the DCS-Attention module features selection of informative channels in the computation of the attention weights. The selection of the feature channels is performed in a differentiable manner, enabling seamless integration with DNN training. Our DCS-Attention is compatible with either fixed neural network backbones or learnable backbones with Differentiable Neural Architecture Search (DNAS), leading to DCS with Fixed Backbone (DCS-FB) and DCS-DNAS, respectively. Importantly, our DCS-Attention is motivated by the principle of Information Bottleneck (IB), and a novel variational upper bound for the IB loss, which can be optimized by SGD, is derived and incorporated into the training loss of the networks with the DCS-Attention modules. In this manner, a neural network with DCS-Attention modules is capable of selecting the most informative channels for feature extraction so that it enjoys state-of-the-art performance for the Re-ID task. Extensive experiments on multiple person Re-ID benchmarks using both DCS-FB and DCS-DNAS show that DCS-Attention significantly enhances the prediction accuracy of DNNs for person Re-ID, which demonstrates the effectiveness of DCS-Attention in learning discriminative features critical to identifying person identities. The code of our work is available at this https URL.
在这篇论文中,我们提出了一种称为可微通道选择注意力模块(Differentiable Channel Selection Attention module,简称DCS-Attention模块)的新型注意机制。与传统的自注意力机制不同,DCS-Attention模块在计算注意力权重时具有选择信息量丰富的通道的功能。该通道的选择过程是以一种可微分的方式进行的,从而能够无缝地集成到深度神经网络(DNN)的训练过程中。 我们的DCS-Attention模块既可以应用于固定结构的神经网络基础架构,也可以应用于通过可微神经体系结构搜索(Differentiable Neural Architecture Search, DNAS)学习得到的基础架构,分别称为具有固定骨干网的DCS (DCS-FB)和使用DNAS的DCS-DNAS。尤为重要的是,我们的DCS-Attention模块是基于信息瓶颈原则(Information Bottleneck, IB)设计的,并且我们还推导出一种新的适用于IB损失的变分上界,这个上界可以通过随机梯度下降(SGD)进行优化并集成到包含DCS-Attention模块的网络训练过程中。通过这种方式,在使用DCS-Attention模块的神经网络中,可以选出用于特征提取的信息量最大的通道,从而为重识别(Re-ID)任务提供最先进的性能。 我们在多个人员重识别基准数据集上进行了广泛实验,使用了既包括固定骨干网的DCS-FB也包括通过DNAS学习得到骨干网的DCS-DNAS方法,结果显示,DCS-Attention模块显著提高了深度神经网络在人员重识别中的预测准确性。这证明了DCS-Attention机制在学习区分性特征以准确辨别人的身份方面是非常有效的。 我们的工作代码可以在提供的链接处获取。
https://arxiv.org/abs/2505.08961
Tracking a target person from robot-egocentric views is crucial for developing autonomous robots that provide continuous personalized assistance or collaboration in Human-Robot Interaction (HRI) and Embodied AI. However, most existing target person tracking (TPT) benchmarks are limited to controlled laboratory environments with few distractions, clean backgrounds, and short-term occlusions. In this paper, we introduce a large-scale dataset designed for TPT in crowded and unstructured environments, demonstrated through a robot-person following task. The dataset is collected by a human pushing a sensor-equipped cart while following a target person, capturing human-like following behavior and emphasizing long-term tracking challenges, including frequent occlusions and the need for re-identification from numerous pedestrians. It includes multi-modal data streams, including odometry, 3D LiDAR, IMU, panoptic, and RGB-D images, along with exhaustively annotated 2D bounding boxes of the target person across 35 sequences, both indoors and outdoors. Using this dataset and visual annotations, we perform extensive experiments with existing TPT methods, offering a thorough analysis of their limitations and suggesting future research directions.
从机器人自视点追踪目标人物对于开发能够提供持续个性化协助或在人机交互(HRI)和具身人工智能中的协作的自主机器人至关重要。然而,现有的大多数目标人员跟踪(TPT)基准测试都局限于控制实验室环境,这些环境中干扰较少、背景干净且遮挡短暂。本文中,我们介绍了一种大规模数据集,该数据集旨在用于拥挤且无结构环境下的目标人物追踪,通过一个机器人跟随任务进行展示。该数据集是由一个人推动装有传感器的小车并跟踪目标人员的过程中收集的,在这个过程中捕捉到了类似人类的跟随行为,并着重于长期跟踪挑战,包括频繁的遮挡和需要从众多行人中重新识别的目标人物。该数据集包含了多模态的数据流,如里程计、3D激光雷达、IMU、全景图以及RGB-D图像,并且详细注释了目标人员在室内和室外共计35个序列中的2D边界框。利用这一数据集和视觉标注信息,我们对现有的TPT方法进行了广泛的实验,提供了对其局限性的全面分析并指出了未来的研究方向。
https://arxiv.org/abs/2505.07446