Software logs are messages recorded during the execution of a software system that provide crucial run-time information about events and activities. Although software logs have a critical role in software maintenance and operation tasks, publicly accessible log datasets remain limited, hindering advance in log analysis research and practices. The presence of sensitive information, particularly Personally Identifiable Information (PII) and quasi-identifiers, introduces serious privacy and re-identification risks, discouraging the publishing and sharing of real-world logs. In practice, log anonymization techniques primarily rely on regular expression patterns, which involve manually crafting rules to identify and replace sensitive information. However, these regex-based approaches suffer from significant limitations, such as extensive manual efforts and poor generalizability across diverse log formats and datasets. To mitigate these limitations, we introduce SDLog, a deep learning-based framework designed to identify sensitive information in software logs. Our results show that SDLog overcomes regex limitations and outperforms the best-performing regex patterns in identifying sensitive information. With only 100 fine-tuning samples from the target dataset, SDLog can correctly identify 99.5% of sensitive attributes and achieves an F1-score of 98.4%. To the best of our knowledge, this is the first deep learning alternative to regex-based methods in software log anonymization.
软件日志是在软件系统执行过程中记录的消息,提供了关于事件和活动的重要运行时信息。尽管软件日志在软件维护和操作任务中起着关键作用,但公开可访问的日志数据集仍然有限,这阻碍了日志分析研究和技术的进步。特别是个人身份信息(PII)和其他准标识符的存在引入了严重的隐私和重新识别风险,从而抑制了真实世界日志的发布与共享。 在实践中,日志匿名化技术主要依赖于正则表达式模式,这种方法需要手动制定规则来识别并替换敏感信息。然而,基于正则表达式的这种方法存在显著局限性,包括耗时的手动工作量以及对不同格式和数据集的泛化能力较差的问题。为了克服这些限制,我们引入了SDLog框架,这是一个基于深度学习的方法,旨在从软件日志中识别敏感信息。 我们的研究结果显示,与现有的正则表达式方法相比,SDLog能够超越其局限性,并在识别敏感信息方面表现优异。仅需使用目标数据集中的100个微调样本,SDLog就能正确地识别出99.5%的敏感属性,并达到F1分数为98.4%的成绩。据我们所知,这是首个针对软件日志匿名化中基于正则表达式方法的深度学习替代方案。
https://arxiv.org/abs/2505.14976
Documents revealing sensitive information about individuals must typically be de-identified. This de-identification is often done by masking all mentions of personally identifiable information (PII), thereby making it more difficult to uncover the identity of the person(s) in question. To investigate the robustness of de-identification methods, we present a novel, RAG-inspired approach that attempts the reverse process of re-identification based on a database of documents representing background knowledge. Given a text in which personal identifiers have been masked, the re-identification proceeds in two steps. A retriever first selects from the background knowledge passages deemed relevant for the re-identification. Those passages are then provided to an infilling model which seeks to infer the original content of each text span. This process is repeated until all masked spans are replaced. We evaluate the re-identification on three datasets (Wikipedia biographies, court rulings and clinical notes). Results show that (1) as many as 80% of de-identified text spans can be successfully recovered and (2) the re-identification accuracy increases along with the level of background knowledge.
包含个人敏感信息的文件通常必须进行去标识化处理。这一过程通常是通过屏蔽所有提及的身份识别信息(PII)来完成,从而使得确定相关人员身份变得更加困难。为了评估去标识化方法的有效性,我们提出了一种基于RAG(Retriever-Augmented Generation)启发的新方法,该方法试图根据包含背景知识的文档数据库执行反向标识过程即重新识别。给定一个已经屏蔽个人标识符的文本,在重新识别过程中会分两步进行:首先由检索器从背景知识中选择相关片段;然后将这些片段提供给填充模型,以推断出每个被遮盖文本片段的原始内容。这一过程重复进行直到所有被遮盖的部分都被替换为止。 我们在三个数据集上评估了重新识别效果(维基百科传记、法院裁决和临床记录)。结果显示:(1)多达80%的去标识化文本片段可以成功恢复;(2)随着背景知识水平的提高,重新识别准确性也有所提升。
https://arxiv.org/abs/2505.12859
This work focuses on Clothes Changing Re-IDentification (CC-ReID) for the real world. Existing works perform well with high-quality (HQ) images, but struggle with low-quality (LQ) where we can have artifacts like pixelation, out-of-focus blur, and motion blur. These artifacts introduce noise to not only external biometric attributes (e.g. pose, body shape, etc.) but also corrupt the model's internal feature representation. Models usually cluster LQ image features together, making it difficult to distinguish between them, leading to incorrect matches. We propose a novel framework Robustness against Low-Quality (RLQ) to improve CC-ReID model on real-world data. RLQ relies on Coarse Attributes Prediction (CAP) and Task Agnostic Distillation (TAD) operating in alternate steps in a novel training mechanism. CAP enriches the model with external fine-grained attributes via coarse predictions, thereby reducing the effect of noisy inputs. On the other hand, TAD enhances the model's internal feature representation by bridging the gap between HQ and LQ features, via an external dataset through task-agnostic self-supervision and distillation. RLQ outperforms the existing approaches by 1.6%-2.9% Top-1 on real-world datasets like LaST, and DeepChange, while showing consistent improvement of 5.3%-6% Top-1 on PRCC with competitive performance on LTCC. *The code will be made public soon.*
这项工作专注于现实世界中的衣物更换重识别(CC-ReID)。现有方法在高质量(HQ)图像上表现出色,但在低质量(LQ)图像中则面临挑战,这些图像可能包含诸如像素化、焦点模糊和运动模糊等缺陷。这些问题不仅会影响外部生物特征属性(如姿态、体型等),还会破坏模型的内部特征表示。因此,模型通常将低质量图像的特征聚类在一起,这使得难以区分它们,从而导致错误匹配。 我们提出了一种名为鲁棒性低质量框架(RLQ)的新方法,以改进现实世界数据中的CC-ReID模型性能。RLQ依赖于粗略属性预测(CAP)和任务无关蒸馏(TAD),这两种技术交替在一种新的训练机制中运行。CAP通过粗略预测为模型提供了细粒度的外部属性信息,从而减少了噪声输入的影响。另一方面,TAD通过使用外部数据集并通过无任务自监督和蒸馏方法弥合高质量与低质量特征之间的差距,增强了模型的内部特征表示。 在现实世界的数据集中,如LaST、DeepChange,RLQ比现有方法提高了1.6%-2.9%的Top-1精度。同时,在PRCC数据集上显示了5.3%-6% Top-1的一致改进,并且在LTCC上的性能也具有竞争力。*代码将很快公开发布。*
https://arxiv.org/abs/2505.12580
Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when compared to a non-fine-tuned model.
视觉叙事系统在维护角色身份以及将动作与适当主体关联方面面临挑战,这常常导致指代幻觉。这些问题可以通过基于视觉元素对人物、物体和其他实体进行定位来解决。我们提出了StoryReasoning数据集,包含从52,016张电影图片中提取的4,178个故事,并提供了结构化的场景分析和基于视觉定位的故事文本。每个故事在不同帧之间保持了角色和对象的一致性,并通过结构化表格表示显式地建模多帧之间的关系。 我们的方法包括利用视觉相似度和面部识别进行跨帧对象重新识别,链式思维推理用于明确叙述模型构建,以及将文本元素链接到多个帧中的视觉实体的定位方案。我们通过对Qwen2.5-VL 7B进行微调来建立基线性能,创建了Qwen讲故事模型,该模型在故事中保持一致的对象引用的同时执行端到端的对象检测、重新识别和地标检测。评估结果表明,与未经微调的模型相比,平均每个故事中的幻觉减少了12.3%(从4.06减少到3.56)。
https://arxiv.org/abs/2505.10292
In this paper, we propose a novel attention module termed the Differentiable Channel Selection Attention module, or the DCS-Attention module. In contrast with conventional self-attention, the DCS-Attention module features selection of informative channels in the computation of the attention weights. The selection of the feature channels is performed in a differentiable manner, enabling seamless integration with DNN training. Our DCS-Attention is compatible with either fixed neural network backbones or learnable backbones with Differentiable Neural Architecture Search (DNAS), leading to DCS with Fixed Backbone (DCS-FB) and DCS-DNAS, respectively. Importantly, our DCS-Attention is motivated by the principle of Information Bottleneck (IB), and a novel variational upper bound for the IB loss, which can be optimized by SGD, is derived and incorporated into the training loss of the networks with the DCS-Attention modules. In this manner, a neural network with DCS-Attention modules is capable of selecting the most informative channels for feature extraction so that it enjoys state-of-the-art performance for the Re-ID task. Extensive experiments on multiple person Re-ID benchmarks using both DCS-FB and DCS-DNAS show that DCS-Attention significantly enhances the prediction accuracy of DNNs for person Re-ID, which demonstrates the effectiveness of DCS-Attention in learning discriminative features critical to identifying person identities. The code of our work is available at this https URL.
在这篇论文中,我们提出了一种称为可微通道选择注意力模块(Differentiable Channel Selection Attention module,简称DCS-Attention模块)的新型注意机制。与传统的自注意力机制不同,DCS-Attention模块在计算注意力权重时具有选择信息量丰富的通道的功能。该通道的选择过程是以一种可微分的方式进行的,从而能够无缝地集成到深度神经网络(DNN)的训练过程中。 我们的DCS-Attention模块既可以应用于固定结构的神经网络基础架构,也可以应用于通过可微神经体系结构搜索(Differentiable Neural Architecture Search, DNAS)学习得到的基础架构,分别称为具有固定骨干网的DCS (DCS-FB)和使用DNAS的DCS-DNAS。尤为重要的是,我们的DCS-Attention模块是基于信息瓶颈原则(Information Bottleneck, IB)设计的,并且我们还推导出一种新的适用于IB损失的变分上界,这个上界可以通过随机梯度下降(SGD)进行优化并集成到包含DCS-Attention模块的网络训练过程中。通过这种方式,在使用DCS-Attention模块的神经网络中,可以选出用于特征提取的信息量最大的通道,从而为重识别(Re-ID)任务提供最先进的性能。 我们在多个人员重识别基准数据集上进行了广泛实验,使用了既包括固定骨干网的DCS-FB也包括通过DNAS学习得到骨干网的DCS-DNAS方法,结果显示,DCS-Attention模块显著提高了深度神经网络在人员重识别中的预测准确性。这证明了DCS-Attention机制在学习区分性特征以准确辨别人的身份方面是非常有效的。 我们的工作代码可以在提供的链接处获取。
https://arxiv.org/abs/2505.08961
Tracking a target person from robot-egocentric views is crucial for developing autonomous robots that provide continuous personalized assistance or collaboration in Human-Robot Interaction (HRI) and Embodied AI. However, most existing target person tracking (TPT) benchmarks are limited to controlled laboratory environments with few distractions, clean backgrounds, and short-term occlusions. In this paper, we introduce a large-scale dataset designed for TPT in crowded and unstructured environments, demonstrated through a robot-person following task. The dataset is collected by a human pushing a sensor-equipped cart while following a target person, capturing human-like following behavior and emphasizing long-term tracking challenges, including frequent occlusions and the need for re-identification from numerous pedestrians. It includes multi-modal data streams, including odometry, 3D LiDAR, IMU, panoptic, and RGB-D images, along with exhaustively annotated 2D bounding boxes of the target person across 35 sequences, both indoors and outdoors. Using this dataset and visual annotations, we perform extensive experiments with existing TPT methods, offering a thorough analysis of their limitations and suggesting future research directions.
从机器人自视点追踪目标人物对于开发能够提供持续个性化协助或在人机交互(HRI)和具身人工智能中的协作的自主机器人至关重要。然而,现有的大多数目标人员跟踪(TPT)基准测试都局限于控制实验室环境,这些环境中干扰较少、背景干净且遮挡短暂。本文中,我们介绍了一种大规模数据集,该数据集旨在用于拥挤且无结构环境下的目标人物追踪,通过一个机器人跟随任务进行展示。该数据集是由一个人推动装有传感器的小车并跟踪目标人员的过程中收集的,在这个过程中捕捉到了类似人类的跟随行为,并着重于长期跟踪挑战,包括频繁的遮挡和需要从众多行人中重新识别的目标人物。该数据集包含了多模态的数据流,如里程计、3D激光雷达、IMU、全景图以及RGB-D图像,并且详细注释了目标人员在室内和室外共计35个序列中的2D边界框。利用这一数据集和视觉标注信息,我们对现有的TPT方法进行了广泛的实验,提供了对其局限性的全面分析并指出了未来的研究方向。
https://arxiv.org/abs/2505.07446
Gait recognition, known for its ability to identify individuals from a distance, has gained significant attention in recent times due to its non-intrusive verification. While video-based gait identification systems perform well on large public datasets, their performance drops when applied to real-world, unconstrained gait data due to various factors. Among these, uncontrolled outdoor environments, non-overlapping camera views, varying illumination, and computational efficiency are core challenges in gait-based authentication. Currently, no dataset addresses all these challenges simultaneously. In this paper, we propose an OptiGait-LGBM model capable of recognizing person re-identification under these constraints using a skeletal model approach, which helps mitigate inconsistencies in a person's appearance. The model constructs a dataset from landmark positions, minimizing memory usage by using non-sequential data. A benchmark dataset, RUET-GAIT, is introduced to represent uncontrolled gait sequences in complex outdoor environments. The process involves extracting skeletal joint landmarks, generating numerical datasets, and developing an OptiGait-LGBM gait classification model. Our aim is to address the aforementioned challenges with minimal computational cost compared to existing methods. A comparative analysis with ensemble techniques such as Random Forest and CatBoost demonstrates that the proposed approach outperforms them in terms of accuracy, memory usage, and training time. This method provides a novel, low-cost, and memory-efficient video-based gait recognition solution for real-world scenarios.
步态识别因其能够从远处识别个人而不侵扰的特点,在近期引起了广泛关注。基于视频的步态识别系统在大型公共数据集上表现出色,但在实际应用中面对无约束环境时性能下降,这主要是由于不受控的户外环境、摄像机视角不一致、光照变化以及计算效率低下等因素造成的。目前尚无数据集同时解决这些挑战。 本文提出了一种名为OptiGait-LGBM的模型,该模型能够在上述限制条件下进行人员再识别,并使用骨骼模型的方法来减少外观上的不一致性。通过利用关键点的位置构建数据集,该方法能够降低内存消耗并处理非连续的数据。我们还引入了一个基准数据集RUET-GAIT,用于表示复杂户外环境中不受控的步态序列。研究过程包括提取骨骼关节的关键点、生成数值化数据集,并开发OptiGait-LGBM步态分类模型。 我们的目标是在与现有方法相比具有较低计算成本的情况下解决上述挑战。通过与随机森林和CatBoost等集成技术进行比较分析,证明了我们提出的方法在准确率、内存使用量以及训练时间方面均优于这些方法。这种方法提供了一种新颖的、低成本且内存高效的视频步态识别解决方案,适用于实际场景中的应用。
https://arxiv.org/abs/2505.08801
Video surveillance image analysis and processing is a challenging field in computer vision, with one of its most difficult tasks being Person Re-Identification (PRe-ID). PRe-ID aims to identify and track target individuals who have already been detected in a network of cameras, using a robust description of their pedestrian images. The success of recent research in person PRe-ID is largely due to effective feature extraction and representation, as well as the powerful learning of these features to reliably discriminate between pedestrian images. To this end, two powerful features, Convolutional Neural Networks (CNN) and Local Maximal Occurrence (LOMO), are modeled on multidimensional data using the proposed method, High-Dimensional Feature Fusion (HDFF). Specifically, a new tensor fusion scheme is introduced to leverage and combine these two types of features in a single tensor, even though their dimensions are not identical. To enhance the system's accuracy, we employ Tensor Cross-View Quadratic Analysis (TXQDA) for multilinear subspace learning, followed by cosine similarity for matching. TXQDA efficiently facilitates learning while reducing the high dimensionality inherent in high-order tensor data. The effectiveness of our approach is verified through experiments on three widely-used PRe-ID datasets: VIPeR, GRID, and PRID450S. Extensive experiments demonstrate that our approach outperforms recent state-of-the-art methods.
视频监控图像分析和处理是计算机视觉领域中的一个挑战性课题,其中最难的任务之一就是行人再识别(Person Re-Identification,PRe-ID)。PRe-ID的目标是在由多个摄像头组成的网络中对已经检测到的特定个人进行识别与追踪。这需要通过强有力的描述来刻画行人的图像特征。近年来,PRe-ID研究的成功很大程度上归功于有效的特征提取和表示方法,以及这些特征的强大学习能力,使其能够可靠地区分行人图像。 为此,本文提出了一种高维特征融合(High-Dimensional Feature Fusion, HDFF)的方法,在这种方法中,利用两种强大的特征——卷积神经网络(Convolutional Neural Networks, CNN)和局部最大发生率(Local Maximal Occurrence, LOMO)来建模多维度数据。具体来说,我们引入了一种新的张量融合方案,即使这两种特征的维度不相同,也能将它们组合成一个单一的张量中进行处理。为了提高系统的准确性,我们在高阶张量数据固有的高维性学习过程中采用了张量跨视图二次分析(Tensor Cross-View Quadratic Analysis, TXQDA)和余弦相似度匹配的方法来进行多线性子空间学习。 我们通过在三个广泛使用的PRe-ID数据集:VIPeR、GRID和PRID450S上进行实验,验证了该方法的有效性。大量的实验证明我们的方法优于近期最先进的方法。
https://arxiv.org/abs/2505.15825
Text-based pedestrian search (TBPS) in full images aims to locate a target pedestrian in untrimmed images using natural language descriptions. However, in complex scenes with multiple pedestrians, existing methods are limited by uncertainties in detection and matching, leading to degraded performance. To address this, we propose UPD-TBPS, a novel framework comprising three modules: Multi-granularity Uncertainty Estimation (MUE), Prototype-based Uncertainty Decoupling (PUD), and Cross-modal Re-identification (ReID). MUE conducts multi-granularity queries to identify potential targets and assigns confidence scores to reduce early-stage uncertainty. PUD leverages visual context decoupling and prototype mining to extract features of the target pedestrian described in the query. It separates and learns pedestrian prototype representations at both the coarse-grained cluster level and the fine-grained individual level, thereby reducing matching uncertainty. ReID evaluates candidates with varying confidence levels, improving detection and retrieval accuracy. Experiments on CUHK-SYSU-TBPS and PRW-TBPS datasets validate the effectiveness of our framework.
基于文本的行人搜索(TBPS)在全图像中的目标是在未裁剪的图片中使用自然语言描述来定位目标行人。然而,在包含多个行人的复杂场景下,现有方法受到检测和匹配不确定性的影响,导致性能下降。为了解决这个问题,我们提出了UPD-TBPS,这是一种新型框架,包括三个模块:多粒度不确定性估计(MUE)、基于原型的不确定性解耦(PUD)和跨模态重识别(ReID)。MUE进行多粒度查询以确定潜在目标,并分配置信评分来减少早期阶段的不确定性。PUD利用视觉上下文解耦和原型挖掘,从查询中提取目标行人的特征。它在粗粒度聚类级别和细粒度个体级别分离并学习行人原型表示,从而降低匹配不确定性。ReID评估具有不同置信水平的候选者,提高检测和检索准确性。在CUHK-SYSU-TBPS和PRW-TBPS数据集上的实验验证了我们框架的有效性。
https://arxiv.org/abs/2505.03567
Visible-infrared person re-identification (VIReID) provides a solution for ReID tasks in 24-hour scenarios; however, significant challenges persist in achieving satisfactory performance due to the substantial discrepancies between visible (VIS) and infrared (IR) modalities. Existing methods inadequately leverage information from different modalities, primarily focusing on digging distinguishing features from modality-shared information while neglecting modality-specific details. To fully utilize differentiated minutiae, we propose a Base-Detail Feature Learning Framework (BDLF) that enhances the learning of both base and detail knowledge, thereby capitalizing on both modality-shared and modality-specific information. Specifically, the proposed BDLF mines detail and base features through a lossless detail feature extraction module and a complementary base embedding generation mechanism, respectively, supported by a novel correlation restriction method that ensures the features gained by BDLF enrich both detail and base knowledge across VIS and IR features. Comprehensive experiments conducted on the SYSU-MM01, RegDB, and LLCM datasets validate the effectiveness of BDLF.
可见光-红外行人重识别(VIReID)为全天候场景下的重识别任务提供了一种解决方案;然而,由于可见光(VIS)和红外线(IR)模态之间的显著差异,要达到令人满意的性能仍然面临着重大挑战。现有的方法未能充分利用不同模态的信息,主要集中在挖掘模态共享信息中的区分特征,而忽略了模态特有的细节。 为了充分利用差异化的小细节,我们提出了一种基元-细节特征学习框架(BDLF),该框架增强了基础和详细知识的学习能力,并且能够同时利用模态共享和特定于某个模态的信息。具体来说,提出的BDLF通过无损的细节特征提取模块挖掘细节特征,并使用互补的基础嵌入生成机制来获取基础特征,这些都得到了一种新颖的相关性限制方法的支持,该方法确保了由BDLF获得的特征能够丰富跨VIS和IR特征的细节与基础知识。 在SYSU-MM01、RegDB 和LLCM数据集上进行的全面实验验证了BDLF的有效性。
https://arxiv.org/abs/2505.03286
Unsupervised visible-infrared person re-identification (UVI-ReID) aims to retrieve pedestrian images across different modalities without costly annotations, but faces challenges due to the modality gap and lack of supervision. Existing methods often adopt self-training with clustering-generated pseudo-labels but implicitly assume these labels are always correct. In practice, however, this assumption fails due to inevitable pseudo-label noise, which hinders model learning. To address this, we introduce a new learning paradigm that explicitly considers Pseudo-Label Noise (PLN), characterized by three key challenges: noise overfitting, error accumulation, and noisy cluster correspondence. To this end, we propose a novel Robust Duality Learning framework (RoDE) for UVI-ReID to mitigate the effects of noisy pseudo-labels. First, to combat noise overfitting, a Robust Adaptive Learning mechanism (RAL) is proposed to dynamically emphasize clean samples while down-weighting noisy ones. Second, to alleviate error accumulation-where the model reinforces its own mistakes-RoDE employs dual distinct models that are alternately trained using pseudo-labels from each other, encouraging diversity and preventing collapse. However, this dual-model strategy introduces misalignment between clusters across models and modalities, creating noisy cluster correspondence. To resolve this, we introduce Cluster Consistency Matching (CCM), which aligns clusters across models and modalities by measuring cross-cluster similarity. Extensive experiments on three benchmarks demonstrate the effectiveness of RoDE.
无监督可见光-红外行人再识别(UVI-ReID)的目标是在无需昂贵注释的情况下,在不同模态下检索行人的图像,但面临着由于模态差距和缺乏监督带来的挑战。现有的方法通常采用自训练技术,使用聚类生成的伪标签,但却隐含地假设这些标签总是正确的。然而在实践中,这一假设会因不可避免的伪标签噪声而失败,从而阻碍模型的学习。为了解决这个问题,我们引入了一种新的学习范式,该范式明确考虑了伪标签噪声(PLN),其特征包括三个关键挑战:噪声过拟合、错误积累和噪音聚类对应关系。 为此,我们提出了一种名为鲁棒二元学习框架(RoDE)的新方法来减轻噪声音频伪标签对UVI-ReID的影响。首先,为了对抗噪声过拟合,提出了一个鲁棒自适应学习机制(RAL),该机制可以动态地强调干净样本的同时抑制噪音样本的权重。 其次,为了解决错误积累的问题——即模型强化自身的错误——RoDE采用了两种不同的模型交替训练的方法,每个模型使用另一个模型产生的伪标签进行训练。这种方法鼓励多样性,并防止模型崩溃。然而,这种双模型策略在不同模型和模态之间引入了聚类错配问题,从而产生了噪音的聚类对应关系。 为了解决这一问题,我们提出了聚类一致性匹配(CCM)方法,通过测量跨集群相似性来对齐不同模型和模态之间的聚类。大量的实验显示,在三个基准测试上RoDE的表现优于其他现有技术,证明了其有效性和鲁棒性。
https://arxiv.org/abs/2505.02549
Face detection and face recognition have been in the focus of vision community since the very beginnings. Inspired by the success of the original Videoface digitizer, a pioneering device that allowed users to capture video signals from any source, we have designed an advanced video analytics tool to efficiently create structured video stories, i.e. identity-based information catalogs. VideoFace2.0 is the name of the developed system for spatial and temporal localization of each unique face in the input video, i.e. face re-identification (ReID), which also allows their cataloging, characterization and creation of structured video outputs for later downstream tasks. Developed near real-time solution is primarily designed to be utilized in application scenarios involving TV production, media analysis, and as an efficient tool for creating large video datasets necessary for training machine learning (ML) models in challenging vision tasks such as lip reading and multimodal speech recognition. Conducted experiments confirm applicability of the proposed face ReID algorithm that is combining the concepts of face detection, face recognition and passive tracking-by-detection in order to achieve robust and efficient face ReID. The system is envisioned as a compact and modular extensions of the existing video production equipment. We hope that the presented work and shared code will stimulate further interest in development of similar, application specific video analysis tools, and lower the entry barrier for production of high-quality multi-modal ML datasets in the future.
自从视觉社区的早期开始,面部检测和面部识别就一直是研究的重点。受到原始Videoface数字仪成功的启发——这是第一个允许用户从任何来源捕获视频信号的开创性设备——我们设计了一种先进的视频分析工具,可以高效地创建基于身份的信息目录,即结构化视频故事。VideoFace2.0是开发出的一套系统,它能够在输入视频中对每个独特的面部进行空间和时间上的定位,也就是所谓的“重新识别”(ReID),同时允许它们被分类、特征描述,并生成用于后续任务的结构化视频输出。这套近实时解决方案主要设计为在涉及电视制作、媒体分析的应用场景中使用,同时也是为了创建大型视频数据集而设计的一种高效工具,这些数据集对于训练机器学习(ML)模型至关重要,尤其是在唇读和多模态语音识别等具有挑战性的视觉任务上。 进行的实验证实了所提出的面部ReID算法的有效性,该算法结合了面部检测、面部识别以及被动追踪的概念以实现稳健且高效的面部重新识别。该系统构想为现有视频制作设备的一个紧凑而模块化的扩展部分。我们希望这项工作和共享的代码能够激发更多对开发类似的应用特定视频分析工具的兴趣,并在未来降低高质量多模态ML数据集生产的门槛。
https://arxiv.org/abs/2505.02060
Online multi-object tracking has been recently dominated by tracking-by-detection (TbD) methods, where recent advances rely on increasingly sophisticated heuristics for tracklet representation, feature fusion, and multi-stage matching. The key strength of TbD lies in its modular design, enabling the integration of specialized off-the-shelf models like motion predictors and re-identification. However, the extensive usage of human-crafted rules for temporal associations makes these methods inherently limited in their ability to capture the complex interplay between various tracking cues. In this work, we introduce CAMEL, a novel association module for Context-Aware Multi-Cue ExpLoitation, that learns resilient association strategies directly from data, breaking free from hand-crafted heuristics while maintaining TbD's valuable modularity. At its core, CAMEL employs two transformer-based modules and relies on a novel association-centric training scheme to effectively model the complex interactions between tracked targets and their various association cues. Unlike end-to-end detection-by-tracking approaches, our method remains lightweight and fast to train while being able to leverage external off-the-shelf models. Our proposed online tracking pipeline, CAMELTrack, achieves state-of-the-art performance on multiple tracking benchmarks. Our code is available at this https URL.
最近,线上多目标跟踪领域主要被基于检测的跟踪(TbD)方法所主导。这些最新的进展依赖于越来越复杂的启发式算法来表示轨迹片段、融合特征以及进行多阶段匹配。TbD的关键优势在于其模块化设计,这种设计允许集成像运动预测器和重识别等专门的现成模型。然而,广泛使用的人工制定规则用于处理时间关联这一做法限制了这些方法捕捉各种跟踪线索之间复杂相互作用的能力。 为此,我们引入了一个名为CAMEL的新颖关联模块,该模块旨在利用上下文感知多线索综合利用(Context-Aware Multi-Cue ExpLoitation)来直接从数据中学习出强大的关联策略。这使得它能够突破人为制定规则的限制,同时保持TbD有价值的模块化特性。 在核心机制上,CAMEL采用两个基于变压器(transformer)的模块,并依赖于一种新颖的以关联为中心的训练方案,以此有效地建模被跟踪目标及其各种关联线索之间的复杂相互作用。与端到端的追踪检测一体化方法不同的是,我们的方法保持轻量级和快速训练的特点,并且能够利用外部现成模型。 我们提出的在线跟踪流水线CAMELTrack,在多个跟踪基准测试中达到了最先进的性能水平。我们的代码可在该链接处获得:[提供链接]。
https://arxiv.org/abs/2505.01257
Practical applications of computer vision in smart cities usually assume system integration and operation in challenging open-world environments. In the case of person re-identification task the main goal is to retrieve information whether the specific person has appeared in another place at a different time instance of the same video, or over multiple camera feeds. This typically assumes collecting raw data from video surveillance cameras in different places and under varying illumination conditions. In the considered open-world setting it also requires detection and localization of the person inside the analyzed video frame before the main re-identification step. With multi-person and multi-camera setups the system complexity becomes higher, requiring sophisticated tracking solutions and re-identification models. In this work we will discuss existing challenges in system design architectures, consider possible solutions based on different computer vision techniques, and describe applications of such systems in retail stores and public spaces for improved marketing analytics. In order to analyse sensitivity of person re-identification task under different open-world environments, a performance of one close to real-time solution will be demonstrated over several video captures and live camera feeds. Finally, based on conducted experiments we will indicate further research directions and possible system improvements.
计算机视觉在智慧城市中的实际应用通常假设系统会在具有挑战性的开放世界环境中进行集成和运行。以人员重识别任务为例,主要目标是从同一视频的不同时段或不同摄像机流中检索特定人物是否出现在另一个地方的信息。这通常需要从不同地点、在不同的光照条件下收集监控摄像头的原始数据。在这种开放世界的设定下,还需要在分析的视频帧内检测和定位人物,然后才能进行主要的重识别步骤。当涉及多人和多相机设置时,系统的复杂性会增加,需要更复杂的跟踪解决方案和重识别模型。 在这项工作中,我们将讨论系统设计架构中存在的现有挑战,并基于不同的计算机视觉技术考虑可能的解决方案。此外,我们还将描述此类系统在零售商店和公共场所的应用,以改进市场营销分析。为了分析人员重识别任务在不同开放世界环境下的敏感性,将演示一种接近实时的解决方案在多个视频片段和现场摄像机流中的性能表现。最后,根据进行的实验,我们将指出进一步的研究方向以及可能的系统改进措施。 这一工作不仅涵盖了技术挑战,还强调了计算机视觉如何用于解决实际问题,特别是在智慧城市环境中提升数据收集、分析及应用的有效性方面。
https://arxiv.org/abs/2505.00772
Visible-Infrared Person Re-Identification (VI-ReID) is a challenging task due to the large modality discrepancy between visible and infrared images, which complicates the alignment of their features into a suitable common space. Moreover, style noise, such as illumination and color contrast, reduces the identity discriminability and modality invariance of features. To address these challenges, we propose a novel Diverse Semantics-guided Feature Alignment and Decoupling (DSFAD) network to align identity-relevant features from different modalities into a textual embedding space and disentangle identity-irrelevant features within each modality. Specifically, we develop a Diverse Semantics-guided Feature Alignment (DSFA) module, which generates pedestrian descriptions with diverse sentence structures to guide the cross-modality alignment of visual features. Furthermore, to filter out style information, we propose a Semantic Margin-guided Feature Decoupling (SMFD) module, which decomposes visual features into pedestrian-related and style-related components, and then constrains the similarity between the former and the textual embeddings to be at least a margin higher than that between the latter and the textual embeddings. Additionally, to prevent the loss of pedestrian semantics during feature decoupling, we design a Semantic Consistency-guided Feature Restitution (SCFR) module, which further excavates useful information for identification from the style-related features and restores it back into the pedestrian-related features, and then constrains the similarity between the features after restitution and the textual embeddings to be consistent with that between the features before decoupling and the textual embeddings. Extensive experiments on three VI-ReID datasets demonstrate the superiority of our DSFAD.
可见光-红外重识别(VI-ReID)是一项具有挑战性的任务,由于可见光和红外图像之间的模态差异较大,使得将它们的特征对齐到一个合适的公共空间变得复杂。此外,如光照和色彩对比等风格噪声降低了身份判别性和跨模态不变性。为了解决这些挑战,我们提出了一种新颖的基于多样化语义指导的特征对齐与解耦(DSFAD)网络,旨在将不同模态中的身份相关特征对齐到文本嵌入空间,并在每种模态内部分离出身份无关的特征。 具体而言,我们开发了一个基于多样化语义引导的特征对齐模块(DSFA),该模块生成具有多样句式结构的人行道描述以指导视觉特征之间的跨模态对齐。此外,为了过滤掉风格信息,我们提出了一种基于语义间隔导向的特征解耦模块(SMFD),将视觉特征分解为人行道相关和风格相关的部分,并且限制前者与文本嵌入相似度至少高于后者与文本嵌入的相似度一定阈值之上。另外,为防止在特征解耦过程中丢失行人语义信息,我们设计了一种基于语义一致性导向的特征恢复模块(SCFR),该模块进一步从风格相关特征中挖掘出识别有用的信息并将其恢复到人行道相关特征中,并且限制了经过恢复后与文本嵌入相似性的保持一致性。 在三个VI-ReID数据集上的大量实验结果证明了我们DSFAD网络的优越性。
https://arxiv.org/abs/2505.00619
Vision sensors are becoming more important in Intelligent Transportation Systems (ITS) for traffic monitoring, management, and optimization as the number of network cameras continues to rise. However, manual object tracking and matching across multiple non-overlapping cameras pose significant challenges in city-scale urban traffic scenarios. These challenges include handling diverse vehicle attributes, occlusions, illumination variations, shadows, and varying video resolutions. To address these issues, we propose an efficient and cost-effective deep learning-based framework for Multi-Object Multi-Camera Tracking (MO-MCT). The proposed framework utilizes Mask R-CNN for object detection and employs Non-Maximum Suppression (NMS) to select target objects from overlapping detections. Transfer learning is employed for re-identification, enabling the association and generation of vehicle tracklets across multiple cameras. Moreover, we leverage appropriate loss functions and distance measures to handle occlusion, illumination, and shadow challenges. The final solution identification module performs feature extraction using ResNet-152 coupled with Deep SORT based vehicle tracking. The proposed framework is evaluated on the 5th AI City Challenge dataset (Track 3), comprising 46 camera feeds. Among these 46 camera streams, 40 are used for model training and validation, while the remaining six are utilized for model testing. The proposed framework achieves competitive performance with an IDF1 score of 0.8289, and precision and recall scores of 0.9026 and 0.8527 respectively, demonstrating its effectiveness in robust and accurate vehicle tracking.
视觉传感器在智能交通系统(ITS)中的作用日益重要,特别是在监控、管理和优化城市交通方面。随着网络摄像头数量的不断增加,跨多个非重叠摄像头的手动目标跟踪和匹配在城市规模的城市交通场景中带来了重大挑战。这些挑战包括处理各种车辆属性、遮挡、光照变化、阴影以及不同的视频分辨率等问题。 为了解决这些问题,我们提出了一种基于深度学习的高效且经济高效的多对象多摄像机跟踪(MO-MCT)框架。该框架使用Mask R-CNN进行目标检测,并采用非极大值抑制(NMS)来从重叠检测中选择目标对象。通过迁移学习实现重新识别功能,以支持多个摄像头之间车辆轨迹关联和生成。此外,我们利用适当的损失函数和距离度量方法来应对遮挡、光照变化和阴影带来的挑战。 最终的解决方案识别模块采用ResNet-152配合Deep SORT进行车辆跟踪,并执行特征提取工作。所提出的框架在第5届AI City Challenge数据集(Track 3)上进行了评估,该数据集包含46个摄像头流。在这46条摄像流中,有40条用于模型训练和验证,其余六条则用于模型测试。提出的框架达到了IDF1得分为0.8289的竞争力性能水平,并且准确率和召回率分别为0.9026和0.8527,这表明该系统在车辆跟踪方面具有强大的鲁棒性和准确性。
https://arxiv.org/abs/2505.00534
Sanitizing sensitive text data typically involves removing personally identifiable information (PII) or generating synthetic data under the assumption that these methods adequately protect privacy; however, their effectiveness is often only assessed by measuring the leakage of explicit identifiers but ignoring nuanced textual markers that can lead to re-identification. We challenge the above illusion of privacy by proposing a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release. Our approach shows that seemingly innocuous auxiliary information -- such as routine social activities -- can be used to infer sensitive attributes like age or substance use history from sanitized data. For instance, we demonstrate that Azure's commercial PII removal tool fails to protect 74\% of information in the MedQA dataset. Although differential privacy mitigates these risks to some extent, it significantly reduces the utility of the sanitized text for downstream tasks. Our findings indicate that current sanitization techniques offer a \textit{false sense of privacy}, highlighting the need for more robust methods that protect against semantic-level information leakage.
清洗敏感文本数据通常涉及删除个人身份信息(PII)或生成合成数据,假设这些方法能够充分保护隐私;然而,它们的有效性往往仅通过衡量明确标识符的泄露来评估,而忽略了可以导致重新识别的细微文本标记。我们通过提出一个新的框架挑战上述对隐私的错觉,该框架用于量化在数据发布后基于重新识别攻击的个体隐私风险。我们的方法表明,看似无害的辅助信息——例如日常社交活动——可以从被清理的数据中推断出敏感属性,如年龄或物质使用历史。例如,我们证明了Azure的商业PII移除工具无法保护MedQA数据集中74%的信息。尽管差分隐私可以在一定程度上缓解这些风险,但它显著降低了清洗文本在下游任务中的实用性。我们的研究结果表明,目前的清洗技术提供了一种“虚假的安全感”,强调需要开发更强大的方法来防止语义层面的信息泄露。
https://arxiv.org/abs/2504.21035
Unsupervised visible-infrared person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning. Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning. However, these methods overlook the cross-modality variations in feature representation and pseudo-label distributions brought by fine-grained patterns. This insight results in insufficient modality-shared learning when only global features are optimized. To address this issue, we propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up optimization objective for specific fine-grained patterns emphasized by each modality, thereby achieving complementary alignment between the label distributions of different modalities. Specifically, we first introduce a Dual Association with Global Learning (DAGI) module to unify the pseudo-labels of cross-modality instances in a bi-directional manner. Afterward, a Fine-Grained Semantic-Aligned Learning (FGSAL) module is carried out to explore part-level semantic-aligned patterns emphasized by each modality from cross-modality instances. Optimization objective is then formulated based on the semantic-aligned features and their corresponding label space. To alleviate the side-effects arising from noisy pseudo-labels, we propose a Global-Part Collaborative Refinement (GPCR) module to mine reliable positive sample sets for the global and part features dynamically and optimize the inter-instance relationships. Extensive experiments demonstrate the effectiveness of the proposed method, which achieves superior performances to state-of-the-art methods. Our code is available at \href{this https URL}.
无监督可见光-红外行人重识别(USL-VI-ReID)旨在匹配不同模态下同一行人的图像,而不依赖于人工标注以训练模型。以往的方法通过标签关联算法统一跨模态图像的伪标签,并设计对比学习框架进行全局特征学习。然而,这些方法忽视了由细粒度模式带来的跨模态在特征表示和伪标签分布上的变化。这种认识导致仅优化全局特征时模态共享学习不足。为解决这一问题,我们提出了一种语义对齐协作细化(SALCR)框架,该框架建立了一个针对每个模态强调的特定细粒度模式进行优化的目标,从而实现不同模态之间标签分布的互补对齐。 具体来说,首先引入了双关联全局学习(DAGI)模块以双向方式统一跨模态实例的伪标签。随后,执行细粒度语义对齐学习(FGSAL)模块,从跨模态实例中探索每个模态强调的部分级语义对齐模式。然后基于语义对齐特征及其对应的标签空间来制定优化目标。 为了缓解由于噪声伪标签带来的副作用,我们提出了一种全局-部分协作细化(GPCR)模块,该模块动态挖掘出对于全局和局部特征来说可靠的正样本集,并优化实例间的关联关系。 广泛的实验表明了所提出方法的有效性,在无监督可见光-红外行人重识别任务上达到了优于当前最优方法的性能。我们的代码可在\href{this https URL}获取。
https://arxiv.org/abs/2504.19244
This contribution explores the impact of synthetic training data usage and the prediction of material wear and aging in the context of re-identification. Different experimental setups and gallery set expanding strategies are tested, analyzing their impact on performance over time for aging re-identification subjects. Using a continuously updating gallery, we were able to increase our mean Rank-1 accuracy by 24%, as material aging was taken into account step by step. In addition, using models trained with 10% artificial training data, Rank-1 accuracy could be increased by up to 13%, in comparison to a model trained on only real-world data, significantly boosting generalized performance on hold-out data. Finally, this work introduces a novel, open-source re-identification dataset, pallet-block-2696. This dataset contains 2,696 images of Euro pallets, taken over a period of 4 months. During this time, natural aging processes occurred and some of the pallets were damaged during their usage. These wear and tear processes significantly changed the appearance of the pallets, providing a dataset that can be used to generate synthetically aged pallets or other wooden materials.
这项贡献探讨了合成训练数据的使用以及在重新识别上下文中预测材料磨损和老化的影响。不同的实验设置和扩展图库策略被测试,分析它们对随时间变化的老化再识别对象性能的影响。通过使用不断更新的图库,我们能够将平均Rank-1准确率提高24%,因为在逐步考虑材料老化的情况下实现了这一提升。此外,与仅基于现实世界数据训练模型相比,采用含有10%人工合成训练数据的模型可以将Rank-1准确率最多提高13%,从而显著增强了在预留测试数据上的泛化性能。最后,这项工作引入了一个新颖且开源的重新识别数据集——pallet-block-2696。该数据集包含2,696张欧洲托盘的照片,拍摄时间为4个月。在此期间,自然老化过程发生,并且一些托盘在使用过程中受损。这些磨损和撕裂的过程显著改变了托盘外观,提供了可用于生成合成老化托盘或其他木质材料的数据集。
https://arxiv.org/abs/2504.18286
Accurate and reliable multi-object tracking (MOT) in 3D space is essential for advancing robotics and computer vision applications. However, it remains a significant challenge in monocular setups due to the difficulty of mining 3D spatiotemporal associations from 2D video streams. In this work, we present three innovative techniques to enhance the fusion and exploitation of heterogeneous cues for monocular 3D MOT: (1) we introduce the Hungarian State Space Model (HSSM), a novel data association mechanism that compresses contextual tracking cues across multiple paths, enabling efficient and comprehensive assignment decisions with linear complexity. HSSM features a global receptive field and dynamic weights, in contrast to traditional linear assignment algorithms that rely on hand-crafted association costs. (2) We propose Fully Convolutional One-stage Embedding (FCOE), which eliminates ROI pooling by directly using dense feature maps for contrastive learning, thus improving object re-identification accuracy under challenging conditions such as varying viewpoints and lighting. (3) We enhance 6-DoF pose estimation through VeloSSM, an encoder-decoder architecture that models temporal dependencies in velocity to capture motion dynamics, overcoming the limitations of frame-based 3D inference. Experiments on the KITTI public test benchmark demonstrate the effectiveness of our method, achieving a new state-of-the-art performance of 76.86~HOTA at 31~FPS. Our approach outperforms the previous best by significant margins of +2.63~HOTA and +3.62~AssA, showcasing its robustness and efficiency for monocular 3D MOT tasks. The code and models are available at this https URL.
在三维空间中进行准确且可靠的多目标跟踪(MOT)对于推进机器人技术和计算机视觉应用至关重要。然而,在单目设置下,从二维视频流中挖掘三维时空关联仍然是一个重大挑战。在此工作中,我们提出了三种创新技术来增强异构线索的融合和利用,以提升单目3D MOT: 1. 我们引入了匈牙利状态空间模型(HSSM),这是一种新颖的数据关联机制,能够压缩多个路径上的上下文跟踪线索,并实现线性复杂度下的高效且全面的任务分配决策。与传统依赖手工设计的关联成本的线性分配算法不同,HSSM具备全局感受野和动态权重。 2. 我们提出了全卷积单阶段嵌入(FCOE),通过直接利用密集特征图进行对比学习来消除ROI池化步骤,在变化视角和光照等挑战条件下提高物体重新识别精度。 3. 我们通过VeloSSM架构增强了六自由度姿态估计,这是一种编码-解码结构,它通过对速度建模时间依赖性来捕捉运动动力学特性,并克服了基于帧的三维推理限制。 在KITTI公开测试基准上的实验表明,我们的方法实现了76.86 HOTA的新顶级性能,同时运行速率达到31 FPS。相较于先前的最佳方法,我们的方式在HOTA和AssA指标上分别提高了+2.63和+3.62的得分,这突显了其对单目三维MOT任务的强大稳健性和效率。 我们的代码和模型可在上述链接中获取。
https://arxiv.org/abs/2504.18068