The multi-camera vehicle tracking (MCVT) framework holds significant potential for smart city applications, including anomaly detection, traffic density estimation, and suspect vehicle tracking. However, current publicly available datasets exhibit limitations, such as overly simplistic scenarios, low-resolution footage, and insufficiently diverse conditions, creating a considerable gap between academic research and real-world scenario. To fill this gap, we introduce RoundaboutHD, a comprehensive, high-resolution multi-camera vehicle tracking benchmark dataset specifically designed to represent real-world roundabout scenarios. RoundaboutHD provides a total of 40 minutes of labelled video footage captured by four non-overlapping, high-resolution (4K resolution, 15 fps) cameras. In total, 512 unique vehicle identities are annotated across different camera views, offering rich cross-camera association data. RoundaboutHD offers temporal consistency video footage and enhanced challenges, including increased occlusions and nonlinear movement inside the roundabout. In addition to the full MCVT dataset, several subsets are also available for object detection, single camera tracking, and image-based vehicle re-identification (ReID) tasks. Vehicle model information and camera modelling/ geometry information are also included to support further analysis. We provide baseline results for vehicle detection, single-camera tracking, image-based vehicle re-identification, and multi-camera tracking. The dataset and the evaluation code are publicly available at: this https URL
多摄像头车辆跟踪(MCVT)框架在智慧城市应用中具有巨大潜力,包括异常检测、交通密度估算和可疑车辆追踪。然而,当前公开的数据集存在一些局限性,例如场景过于简单化、视频分辨率低以及条件多样性不足等问题,这使得学术研究与实际应用场景之间存在着相当大的差距。为了解决这一问题,我们引入了RoundaboutHD数据集——这是一个全面的多摄像头高分辨率车辆跟踪基准数据集,专门设计用于代表现实世界的环形交叉路口场景。RoundaboutHD提供了由四个非重叠、高分辨率(4K分辨率,15fps)摄像头拍摄的总计40分钟的带标签视频片段,并在不同摄像头视角中对总共512辆独特车辆进行了标注,从而提供丰富的跨摄像机关联数据。 RoundaboutHD还为多摄像头跟踪任务提供了时间一致性视频片段和增强挑战,包括增加的遮挡以及环形交叉路口内的非线性运动。除了完整的MCVT数据集之外,还包括几个子集用于物体检测、单个摄像头追踪和基于图像的车辆重识别(ReID)任务。此外还包含了关于车辆型号信息及摄像机建模/几何信息等附加支持材料以供进一步分析使用。 我们为车辆检测、单摄像机跟踪、基于图像的车辆再识别以及多摄像机跟踪提供了基准结果。该数据集和评估代码可以公开访问,网址如下:[此链接](https://this-url.com)(请将实际URL填入)。
https://arxiv.org/abs/2507.08729
Common knowledge indicates that the process of constructing image datasets usually depends on the time-intensive and inefficient method of manual collection and annotation. Large models offer a solution via data generation. Nonetheless, real-world data are obviously more valuable comparing to artificially intelligence generated data, particularly in constructing image datasets. For this reason, we propose a novel method for auto-constructing datasets from real-world images by a multiagent collaborative system, named as DatasetAgent. By coordinating four different agents equipped with Multi-modal Large Language Models (MLLMs), as well as a tool package for image optimization, DatasetAgent is able to construct high-quality image datasets according to user-specified requirements. In particular, two types of experiments are conducted, including expanding existing datasets and creating new ones from scratch, on a variety of open-source datasets. In both cases, multiple image datasets constructed by DatasetAgent are used to train various vision models for image classification, object detection, and image segmentation.
常识表明,构建图像数据集通常依赖于耗时且效率低下的手动收集和标注方法。大型模型通过生成数据提供了一种解决方案。然而,与人工合成的数据相比,现实世界中的数据在构建图像数据集中显然更有价值。因此,我们提出了一种使用多智能体协作系统从真实世界的图像自动构建数据集的新方法,命名为DatasetAgent。通过协调四个配备有多模态大型语言模型(MLLM)的代理以及用于图像优化的工具包,DatasetAgent能够根据用户指定的要求构建高质量的图像数据集。特别地,在各种开源数据集上进行了两种类型的实验,包括扩展现有数据集和从头开始创建新数据集。在这两种情况下,使用多个由DatasetAgent构建的图像数据集来训练多种视觉模型以进行图像分类、目标检测和图像分割任务。
https://arxiv.org/abs/2507.08648
Multi-view camera-based 3D perception can be conducted using bird's eye view (BEV) features obtained through perspective view-to-BEV transformations. Several studies have shown that the performance of these 3D perception methods can be further enhanced by combining sequential BEV features obtained from multiple camera frames. However, even after compensating for the ego-motion of an autonomous agent, the performance gain from temporal aggregation is limited when combining a large number of image frames. This limitation arises due to dynamic changes in BEV features over time caused by object motion. In this paper, we introduce a novel temporal 3D perception method called OnlineBEV, which combines BEV features over time using a recurrent structure. This structure increases the effective number of combined features with minimal memory usage. However, it is critical to spatially align the features over time to maintain strong performance. OnlineBEV employs the Motion-guided BEV Fusion Network (MBFNet) to achieve temporal feature alignment. MBFNet extracts motion features from consecutive BEV frames and dynamically aligns historical BEV features with current ones using these motion features. To enforce temporal feature alignment explicitly, we use Temporal Consistency Learning Loss, which captures discrepancies between historical and target BEV features. Experiments conducted on the nuScenes benchmark demonstrate that OnlineBEV achieves significant performance gains over the current best method, SOLOFusion. OnlineBEV achieves 63.9% NDS on the nuScenes test set, recording state-of-the-art performance in the camera-only 3D object detection task.
基于多视角相机的三维感知可以通过透视视图到鸟瞰视图(BEV)变换获得的BEV特征来进行。多项研究表明,通过结合多个摄像机帧获取的连续BEV特征,可以进一步增强这些三维感知方法的性能。然而,在补偿自主代理的自我运动后,当合并大量图像帧时,时间聚合带来的性能提升是有限的,这主要是由于随着时间变化,由物体移动引起的BEV特征动态变化所导致的。 在这篇文章中,我们介绍了一种新颖的时间三维感知方法——OnlineBEV,它通过递归结构结合随时间变化的BEV特征。这种结构能够在最小化内存使用的情况下增加有效组合的功能数量。然而,在保持高性能的同时,跨时间对齐功能是非常重要的。为了实现这一点,OnlineBEV采用了基于运动引导的BEV融合网络(MBFNet)来完成时间特性对准。MBFNet从连续的BEV帧中提取运动特征,并利用这些运动特征动态地将历史BEV特征与当前特征对齐。为确保显式的时间特性对齐,我们使用了时间一致性学习损失,该损失捕捉到历史和目标BEV特征之间的差异。 在nuScenes基准测试上进行的实验表明,OnlineBEV相比目前最佳方法SOLOFusion取得了显著的性能提升。在nuScenes测试集上,OnlineBEV达到了63.9%的NDS(平均精度),创下了仅使用相机的三维物体检测任务中的最新记录。
https://arxiv.org/abs/2507.08644
Real-world applications of computer vision in the humanities require algorithms to be robust against artistic abstraction, peripheral objects, and subtle differences between fine-grained target classes. Existing datasets provide instance-level annotations on artworks but are generally biased towards the image centre and limited with regard to detailed object classes. The proposed ODOR dataset fills this gap, offering 38,116 object-level annotations across 4712 images, spanning an extensive set of 139 fine-grained categories. Conducting a statistical analysis, we showcase challenging dataset properties, such as a detailed set of categories, dense and overlapping objects, and spatial distribution over the whole image canvas. Furthermore, we provide an extensive baseline analysis for object detection models and highlight the challenging properties of the dataset through a set of secondary studies. Inspiring further research on artwork object detection and broader visual cultural heritage studies, the dataset challenges researchers to explore the intersection of object recognition and smell perception.
在人文领域的计算机视觉实际应用中,算法需要能够抵抗艺术抽象、周边物体以及精细目标类别之间细微差异的影响。现有的数据集提供了艺术品实例级别的注释,但通常偏向于图像中心,并且对于详细的对象类别的覆盖有限。为填补这一空白,提出了ODOR数据集,该数据集提供了38,116个对象级别注释,涵盖4712张图片,跨越了广泛的139种精细分类类别。通过统计分析,我们展示了该数据集具有的挑战性特性,如详细的分类集合、密集且重叠的对象以及在整个图像画布上的空间分布。 此外,我们还提供了针对目标检测模型的全面基线分析,并通过一系列辅助研究突出了数据集具有挑战性的属性。这个数据集旨在激发更多关于艺术品对象检测和更广泛的视觉文化遗产领域的研究,它鼓励研究人员探索物体识别与嗅觉感知之间的交集。
https://arxiv.org/abs/2507.08384
This study investigates the potential of a multimodal large language model (LLM), specifically ChatGPT-4o, to perform human-like interpretations of traffic scenes using static dashcam images. Herein, we focus on three judgment tasks relevant to elderly driver assessments: evaluating traffic density, assessing intersection visibility, and recognizing stop signs recognition. These tasks require contextual reasoning rather than simple object detection. Using zero-shot, few-shot, and multi-shot prompting strategies, we evaluated the performance of the model with human annotations serving as the reference standard. Evaluation metrics included precision, recall, and F1-score. Results indicate that prompt design considerably affects performance, with recall for intersection visibility increasing from 21.7% (zero-shot) to 57.0% (multi-shot). For traffic density, agreement increased from 53.5% to 67.6%. In stop-sign detection, the model demonstrated high precision (up to 86.3%) but a lower recall (approximately 76.7%), indicating a conservative response tendency. Output stability analysis revealed that humans and the model faced difficulties interpreting structurally ambiguous scenes. However, the model's explanatory texts corresponded with its predictions, enhancing interpretability. These findings suggest that, with well-designed prompts, LLMs hold promise as supportive tools for scene-level driving risk assessments. Future studies should explore scalability using larger datasets, diverse annotators, and next-generation model architectures for elderly driver assessments.
这项研究探讨了多模态大型语言模型(LLM),特别是ChatGPT-4o,在使用静态行车记录仪图像进行类似人类的交通场景解读方面的潜力。在此,我们重点关注三种与老年人驾驶员评估相关的判断任务:评价交通密度、评估交叉路口视野和识别停止标志。这些任务需要情境推理而不是简单的物体检测。通过零样本(zero-shot)、少量样本(few-shot)和多样本(multi-shot)提示策略,我们在人类注释作为参考标准的情况下评估了模型的性能。评估指标包括精确度、召回率和F1分数。 结果显示,提示设计显著影响了模型的表现,在交叉路口视野方面,召回率从零样本条件下的21.7%提高到了多样本条件下的57.0%;在交通密度方面,一致性从53.5%增加到67.6%。对于停止标志检测,该模型展现了高精确度(高达86.3%),但较低的召回率(大约为76.7%),表明其倾向于保守反应。 输出稳定性分析显示,人类和模型在解读结构模糊场景时都面临困难,不过,模型生成的解释性文本与其预测相吻合,提高了可解释性。这些发现表明,在设计良好的提示条件下,LLM作为用于场景级驾驶风险评估的支持工具具有潜力。未来的研究应该探索使用更大规模的数据集、多样化的标注人员以及下一代模型架构来进行老年人驾驶员评估的可能性。
https://arxiv.org/abs/2507.08367
As cyber-physical systems grow increasingly interconnected and spatially distributed, ensuring their resilience against evolving cyberattacks has become a critical priority. Spatio-Temporal Anomaly detection plays an important role in ensuring system security and operational integrity. However, current data-driven approaches, largely driven by black-box deep learning, face challenges in interpretability, adaptability to distribution shifts, and robustness under evolving system dynamics. In this paper, we advocate for a causal learning perspective to advance anomaly detection in spatially distributed infrastructures that grounds detection in structural cause-effect relationships. We identify and formalize three key directions: causal graph profiling, multi-view fusion, and continual causal graph learning, each offering distinct advantages in uncovering dynamic cause-effect structures across time and space. Drawing on real-world insights from systems such as water treatment infrastructures, we illustrate how causal models provide early warning signals and root cause attribution, addressing the limitations of black-box detectors. Looking ahead, we outline the future research agenda centered on multi-modality, generative AI-driven, and scalable adaptive causal frameworks. Our objective is to lay a new research trajectory toward scalable, adaptive, explainable, and spatially grounded anomaly detection systems. We hope to inspire a paradigm shift in cybersecurity research, promoting causality-driven approaches to address evolving threats in interconnected infrastructures.
随着物理信息系统变得越来越相互关联和空间分布广泛,确保这些系统能够抵御不断演变的网络攻击已成为一个关键优先事项。时空异常检测在保障系统安全性和操作完整性方面发挥着重要作用。然而,目前以黑盒深度学习为主导的数据驱动方法,在可解释性、适应分布变化的能力以及面对不断发展的系统动态时的鲁棒性上面临着挑战。 本文倡导采用因果推理视角来推进空间分布式基础设施中的异常检测技术,通过建立在结构化因果关系基础上的检测方法。我们确定并形式化了三个关键方向:因果图谱分析、多视图融合和持续因果图学习,每一项都有助于揭示时间和空间维度上的动态因果结构。 借鉴诸如水处理设施等实际系统的经验,本文展示了因果模型如何提供早期预警信号以及根本原因归因,从而克服黑盒检测器的局限性。展望未来,我们概述了以多模态、生成式AI驱动和可扩展适应性为重心的研究议程。我们的目标是开辟一个新研究路径,即向可伸缩、自适应、解释性和基于空间定位的异常检测系统迈进。 我们希望这项工作能激发网络安全研究中的范式转变,促进因果驱动方法的应用以应对互联互通基础设施中不断演变的安全威胁。
https://arxiv.org/abs/2507.08177
Visually impaired people face significant challenges in their day-to-day commutes in the urban cities of Bangladesh due to the vast number of obstructions on every path. With many injuries taking place through road accidents on a daily basis, it is paramount for a system to be developed that can alert the visually impaired of objects at close distance beforehand. To overcome this issue, a novel alert system is proposed in this research to assist the visually impaired in commuting through these busy streets without colliding with any objects. The proposed system can alert the individual to objects that are present at a close distance. It utilizes transfer learning to train models for depth estimation and object detection, and combines both models to introduce a novel system. The models are optimized through the utilization of quantization techniques to make them lightweight and efficient, allowing them to be easily deployed on embedded systems. The proposed solution achieved a lightweight real-time depth estimation and object detection model with an mAP50 of 0.801.
在孟加拉国的城市中,视障人士在日常通勤时面临诸多挑战,主要是因为道路上障碍物众多。由于每天都有许多因交通事故而受伤的事件发生,因此迫切需要一个系统来提前警示视障人士周围近距离的物体,以避免碰撞。为了解决这一问题,本研究提出了一种创新性的警报系统,旨在帮助视障人士在繁忙街道上安全通勤而不致撞到任何障碍物。 该提议系统能够提前向个体发出信号,警告其前方存在近距离内的物体。它通过转移学习来训练深度估计和物体检测的模型,并将两者结合引入一个全新的系统中。通过对量化技术的应用优化这些模型,使得它们变得轻量且高效,从而可以轻松部署到嵌入式设备上。所提出的解决方案实现了一个轻量级的实时深度估计算法和物体检测模型,其mAP50达到了0.801的成绩。
https://arxiv.org/abs/2507.08165
Image sensors are integral to a wide range of safety- and security-critical systems, including surveillance infrastructure, autonomous vehicles, and industrial automation. These systems rely on the integrity of visual data to make decisions. In this work, we investigate a novel class of electromagnetic signal injection attacks that target the analog domain of image sensors, allowing adversaries to manipulate raw visual inputs without triggering conventional digital integrity checks. We uncover a previously undocumented attack phenomenon on CMOS image sensors: rainbow-like color artifacts induced in images captured by image sensors through carefully tuned electromagnetic interference. We further evaluate the impact of these attacks on state-of-the-art object detection models, showing that the injected artifacts propagate through the image signal processing pipeline and lead to significant mispredictions. Our findings highlight a critical and underexplored vulnerability in the visual perception stack, highlighting the need for more robust defenses against physical-layer attacks in such systems.
图像传感器在包括监控基础设施、自动驾驶汽车和工业自动化在内的多种安全和关键系统中扮演着重要角色。这些系统的决策依赖于视觉数据的完整性。在这项研究工作中,我们调查了一类新颖的电磁信号注入攻击,这类攻击针对的是图像传感器的模拟领域,允许对手操纵原始视觉输入而不触发传统的数字完整性检查。我们揭露了CMOS图像传感器上此前未被记录的一种新型攻击现象:通过精心调制的电磁干扰,在由图像传感器捕获的图像中诱导出类似彩虹的颜色伪影。进一步地,我们评估了这些攻击对最先进的目标检测模型的影响,表明注入的伪影会传播到图像信号处理管道,并导致显著的误预测。我们的研究结果强调了视觉感知栈中的一个关键且未被充分探索的安全漏洞,突显了此类系统中需要更强大的物理层防御措施以抵御攻击的重要性。
https://arxiv.org/abs/2507.07773
Phonetic Cloaking Replacement (PCR), defined as the deliberate use of homophonic or near-homophonic variants to hide toxic intent, has become a major obstacle to Chinese content moderation. While this problem is well-recognized, existing evaluations predominantly rely on rule-based, synthetic perturbations that ignore the creativity of real users. We organize PCR into a four-way surface-form taxonomy and compile \ours, a dataset of 500 naturally occurring, phonetically cloaked offensive posts gathered from the RedNote platform. Benchmarking state-of-the-art LLMs on this dataset exposes a serious weakness: the best model reaches only an F1-score of 0.672, and zero-shot chain-of-thought prompting pushes performance even lower. Guided by error analysis, we revisit a Pinyin-based prompting strategy that earlier studies judged ineffective and show that it recovers much of the lost accuracy. This study offers the first comprehensive taxonomy of Chinese PCR, a realistic benchmark that reveals current detectors' limits, and a lightweight mitigation technique that advances research on robust toxicity detection.
声学伪装替换(Phonetic Cloaking Replacement,PCR)是指故意使用同音或近似同音的变体来隐藏恶意意图,已成为中文内容审核的一大障碍。虽然这个问题已被广泛认识到,但现有的评估主要依赖于基于规则的人工合成干扰,忽略了真实用户创意性的变化。我们组织了一个四分类表征形式的声学伪装替换,并编译了\ours数据集,该数据集中包含500个自然发生的、在红笺平台收集到的声音伪装的攻击性帖子。在这一数据集上对最先进的大型语言模型(LLMs)进行基准测试揭示了一个严重的弱点:表现最好的模型也只能达到F1分数为0.672,并且零样本链式思维提示(zero-shot chain-of-thought prompting)甚至进一步降低了性能。基于错误分析,我们重新审视了早期研究认为无效的基于拼音的提示策略,并表明它可以恢复大量丢失的准确性。本研究首次提供了中国PCR的全面分类体系、一个揭示当前检测器局限性的现实基准测试,以及一种轻量级缓解技术,促进了稳健毒性检测的研究进展。
https://arxiv.org/abs/2507.07640
This paper presents a competitive approach to multilingual subjectivity detection using large language models (LLMs) with few-shot prompting. We participated in Task 1: Subjectivity of the CheckThat! 2025 evaluation campaign. We show that LLMs, when paired with carefully designed prompts, can match or outperform fine-tuned smaller language models (SLMs), particularly in noisy or low-quality data settings. Despite experimenting with advanced prompt engineering techniques, such as debating LLMs and various example selection strategies, we found limited benefit beyond well-crafted standard few-shot prompts. Our system achieved top rankings across multiple languages in the CheckThat! 2025 subjectivity detection task, including first place in Arabic and Polish, and top-four finishes in Italian, English, German, and multilingual tracks. Notably, our method proved especially robust on the Arabic dataset, likely due to its resilience to annotation inconsistencies. These findings highlight the effectiveness and adaptability of LLM-based few-shot learning for multilingual sentiment tasks, offering a strong alternative to traditional fine-tuning, particularly when labeled data is scarce or inconsistent.
本文提出了一种使用大型语言模型(LLM)和少量提示进行多语种主观性检测的竞争方法。我们在CheckThat! 2025评估活动中参加了任务1:主观性检测。研究表明,当与精心设计的提示相结合时,LLM可以匹配或超越经过微调的小型语言模型(SLM),尤其是在嘈杂或低质量数据环境中。 尽管我们尝试了先进的提示工程技术,如让LLM进行辩论和多种示例选择策略,但我们发现除了精心制作的标准少量提示外,并没有获得显著的额外好处。我们的系统在CheckThat! 2025主观性检测任务中多个语种的比赛中取得了顶尖排名,包括阿拉伯语和波兰语的第一名以及意大利语、英语、德语和多语言赛道中的前四名。 值得注意的是,我们方法在阿拉伯语文本数据集上表现尤为稳健,这可能是因为其对标注不一致具有较强的鲁棒性。这些发现强调了基于LLM的少量学习对于跨语言情感任务的有效性和适应性,并为传统微调提供了强大的替代方案,特别是在标签数据稀缺或不一致的情况下。
https://arxiv.org/abs/2507.07539
Phishing attacks are becoming increasingly sophisticated, underscoring the need for detection systems that strike a balance between high accuracy and computational efficiency. This paper presents a comparative evaluation of traditional Machine Learning (ML), Deep Learning (DL), and quantized small-parameter Large Language Models (LLMs) for phishing detection. Through experiments on a curated dataset, we show that while LLMs currently underperform compared to ML and DL methods in terms of raw accuracy, they exhibit strong potential for identifying subtle, context-based phishing cues. We also investigate the impact of zero-shot and few-shot prompting strategies, revealing that LLM-rephrased emails can significantly degrade the performance of both ML and LLM-based detectors. Our benchmarking highlights that models like DeepSeek R1 Distill Qwen 14B (Q8_0) achieve competitive accuracy, above 80%, using only 17GB of VRAM, supporting their viability for cost-efficient deployment. We further assess the models' adversarial robustness and cost-performance tradeoffs, and demonstrate how lightweight LLMs can provide concise, interpretable explanations to support real-time decision-making. These findings position optimized LLMs as promising components in phishing defence systems and offer a path forward for integrating explainable, efficient AI into modern cybersecurity frameworks.
网络钓鱼攻击变得越来越复杂,这强调了需要开发一种既能保证高精度又具有计算效率的检测系统。本文通过在精心策划的数据集上进行实验,对传统机器学习(ML)、深度学习(DL)和量化小型参数大型语言模型(LLM)在网络钓鱼检测方面的性能进行了比较评估。 研究发现,尽管目前大型语言模型在原始准确性方面仍然不及机器学习和深度学习方法,但它们在识别细微、基于上下文的网络钓鱼线索方面表现出巨大潜力。我们还调查了零样本和少量样本提示策略对结果的影响,并揭示出由LLM重写的电子邮件会显著降低ML和基于LLM检测器的性能。 我们的基准测试显示,诸如DeepSeek R1 Distill Qwen 14B(Q8_0)这样的模型,在仅使用17GB显存的情况下即可达到超过80%的准确性,这表明它们在成本效益部署方面是可行的选择。此外,我们还评估了这些模型对对抗性攻击的鲁棒性和成本性能权衡,并展示了轻量级LLM如何提供简洁且可解释的说明来支持实时决策。 这些发现将优化后的大型语言模型定位为网络钓鱼防御系统中的有前景组件,并为进一步在现代网络安全框架中集成可解释、高效的AI技术铺平了道路。
https://arxiv.org/abs/2507.07406
Visual effects (VFX) production often struggles with slow, resource-intensive mask generation. This paper presents an automated video segmentation pipeline that creates temporally consistent instance masks. It employs machine learning for: (1) flexible object detection via text prompts, (2) refined per-frame image segmentation and (3) robust video tracking to ensure temporal stability. Deployed using containerization and leveraging a structured output format, the pipeline was quickly adopted by our artists. It significantly reduces manual effort, speeds up the creation of preliminary composites, and provides comprehensive segmentation data, thereby enhancing overall VFX production efficiency.
视觉效果(VFX)制作常常面临生成缓慢且资源密集型遮罩的问题。本文介绍了一种自动化的视频分割流水线,该流水线能够创建时间上一致的实例掩模。此流水线运用机器学习技术实现了以下功能:(1)通过文本提示进行灵活的对象检测,(2)对每一帧图像进行细化的分割处理,以及 (3) 采用稳健的时间轴追踪以确保时间上的稳定性。使用容器化部署并利用结构化的输出格式后,该流水线迅速被我们的艺术家们采纳。 此自动流程显著减少了人工劳动量,加快了初步合成的创作速度,并提供了详尽的分割数据,从而整体上提高了VFX制作效率。
https://arxiv.org/abs/2507.07242
While automated vehicles hold the potential to significantly reduce traffic accidents, their perception systems remain vulnerable to sensor degradation caused by adverse weather and environmental occlusions. Collective perception, which enables vehicles to share information, offers a promising approach to overcoming these limitations. However, to this date collective perception in adverse weather is mostly unstudied. Therefore, we conduct the first study of LiDAR-based collective perception under diverse weather conditions and present a novel multi-task architecture for LiDAR-based collective perception under adverse weather. Adverse weather conditions can not only degrade perception capabilities, but also negatively affect bandwidth requirements and latency due to the introduced noise that is also transmitted and processed. Denoising prior to communication can effectively mitigate these issues. Therefore, we propose DenoiseCP-Net, a novel multi-task architecture for LiDAR-based collective perception under adverse weather conditions. DenoiseCP-Net integrates voxel-level noise filtering and object detection into a unified sparse convolution backbone, eliminating redundant computations associated with two-stage pipelines. This design not only reduces inference latency and computational cost but also minimizes communication overhead by removing non-informative noise. We extended the well-known OPV2V dataset by simulating rain, snow, and fog using our realistic weather simulation models. We demonstrate that DenoiseCP-Net achieves near-perfect denoising accuracy in adverse weather, reduces the bandwidth requirements by up to 23.6% while maintaining the same detection accuracy and reducing the inference latency for cooperative vehicles.
尽管自动驾驶车辆有望大幅减少交通事故,但其感知系统仍然容易受到恶劣天气和环境遮挡导致的传感器退化的影响。集体感知(即车辆间的信息共享)为克服这些限制提供了一种有前景的方法。然而,迄今为止,在恶劣天气条件下进行集体感知的研究仍很少。因此,我们进行了首个基于LiDAR的集体感知在各种天气条件下的研究,并提出了一个新颖的任务多合一架构——用于恶劣天气下基于LiDAR的集体感知。 恶劣天气不仅会降低感知能力,还会由于引入的噪声而增加带宽需求和延迟,这些噪声也会被传输和处理。因此,在通信前进行去噪可以有效缓解这些问题。为此,我们提出了一种新颖的任务多合一架构——DenoiseCP-Net,用于在恶劣天气条件下基于LiDAR的集体感知。DenoiseCP-Net集成了体素级噪声过滤与目标检测到一个统一的稀疏卷积骨干网络中,消除了两阶段流水线相关的冗余计算。这种设计不仅减少了推理延迟和计算成本,还通过去除非信息性噪声来最小化通信开销。 为了研究DenoiseCP-Net的表现,我们扩展了著名的OPV2V数据集,并使用我们的现实天气模拟模型对其进行了雨、雪和雾的仿真。我们展示了在恶劣天气条件下,DenoiseCP-Net能够实现近乎完美的去噪精度,在保持相同检测准确性的前提下减少带宽需求高达23.6%,并降低了协同车辆的推理延迟。
https://arxiv.org/abs/2507.06976
Insects comprise millions of species, many experiencing severe population declines under environmental and habitat changes. High-throughput approaches are crucial for accelerating our understanding of insect diversity, with DNA barcoding and high-resolution imaging showing strong potential for automatic taxonomic classification. However, most image-based approaches rely on individual specimen data, unlike the unsorted bulk samples collected in large-scale ecological surveys. We present the Mixed Arthropod Sample Segmentation and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples. It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens. Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens. Combining the taxonomic resolution of DNA barcodes with precise abundance estimates of bulk images holds great potential for rapid, large-scale characterization of insect communities. This dataset pushes the boundaries of tiny object detection and instance segmentation, fostering innovation in both ecological and machine learning research.
昆虫包括了数以百万计的物种,其中许多在环境和栖息地变化的影响下面临严重的种群下降。为了加速对昆虫多样性的理解,高通量的方法至关重要,DNA条形码技术和高分辨率成像技术展示了自动分类的强大潜力。然而,大多数基于图像的方法依赖于单一标本数据,而大规模生态调查中收集的未排序批量样本则不符合这一条件。我们提出了一个名为“混合节肢动物样品分割和鉴定”(MassID45)的数据集,用于训练自动分类器来处理批量昆虫样本。该数据集独特地结合了分子信息与图像数据,在未排序样本级别以及所有单独标本的完整集合中。 通过人类标注员使用AI辅助工具在批量图像上执行两个任务:围绕每个节肢动物创建分割掩模,并为超过17,000个标本分配分类学标签。将DNA条形码的分类分辨率与批量图像中的精确丰度估计相结合,对于快速、大规模地表征昆虫群落具有巨大潜力。 此数据集推动了微小目标检测和实例分割的研究边界,在生态学和机器学习研究领域激发创新。
https://arxiv.org/abs/2507.06972
Thermal imaging from unmanned aerial vehicles (UAVs) holds significant potential for applications in search and rescue, wildlife monitoring, and emergency response, especially under low-light or obscured conditions. However, the scarcity of large-scale, diverse thermal aerial datasets limits the advancement of deep learning models in this domain, primarily due to the high cost and logistical challenges of collecting thermal data. In this work, we introduce a novel procedural pipeline for generating synthetic thermal images from an aerial perspective. Our method integrates arbitrary object classes into existing thermal backgrounds by providing control over the position, scale, and orientation of the new objects, while aligning them with the viewpoints of the background. We enhance existing thermal datasets by introducing new object categories, specifically adding a drone class in urban environments to the HIT-UAV dataset and an animal category to the MONET dataset. In evaluating these datasets for object detection task, we showcase strong performance across both new and existing classes, validating the successful expansion into new applications. Through comparative analysis, we show that thermal detectors outperform their visible-light-trained counterparts and highlight the importance of replicating aerial viewing angles. Project page: this https URL.
无人飞行器(UAV)上的热成像技术在搜索和救援、野生动物监测以及应急响应等领域具有巨大的应用潜力,尤其是在低光或视线受阻的情况下。然而,由于收集大型多样化热数据集的成本高且物流挑战大,此类深度学习模型的发展受到了限制。在此项工作中,我们介绍了一种用于从空中视角生成合成热图像的新型程序化流程。 我们的方法通过提供对新物体位置、尺寸和方向的控制,在现有的热背景中融合任意对象类别,使这些新加入的对象与背景视点保持一致。我们通过在HIT-UAV数据集中添加无人机类别(特别是在城市环境中)以及向MONET数据集引入动物类别来增强现有热数据集。 评估这些数据集中的目标检测任务时,我们在新旧类别中都展示了强大的性能表现,证明了成功地扩展到新的应用领域。通过对比分析,我们表明热成像探测器优于其可见光训练的对应物,并强调复制空中视角的重要性。 项目页面:[此链接](https://this-url.com/)(原文中的链接请替换为具体地址)。
https://arxiv.org/abs/2507.06797
Autonomous maritime surveillance and target vessel identification in environments where Global Navigation Satellite Systems (GNSS) are not available is critical for a number of applications such as search and rescue and threat detection. When the target vessel is only described by visual cues and its last known position is not available, unmanned aerial vehicles (UAVs) must rely solely on on-board vision to scan a large search area under strict computational constraints. To address this challenge, we leverage the YOLOv8 object detection model to detect all vessels in the field of view. We then apply feature matching and hue histogram distance analysis to determine whether any detected vessel corresponds to the target. When found, we localize the target using simple geometric principles. We demonstrate the proposed method in real-world experiments during the MBZIRC2023 competition, integrated into a fully autonomous system with GNSS-denied navigation. We also evaluate the impact of perspective on detection accuracy and localization precision and compare it with the oracle approach.
在没有全球导航卫星系统(GNSS)的环境中进行自主海上监视和目标船只识别对于搜索救援以及威胁检测等众多应用来说至关重要。当仅通过视觉线索描述目标船只且其最后已知位置不可用时,无人驾驶飞行器(UAVs)必须依赖机载视觉技术,在严格的计算约束下扫描大面积搜索区域。为解决这一挑战,我们利用YOLOv8对象检测模型来识别视野内的所有船只。随后,我们应用特征匹配和色调直方图距离分析以确定任何已检船只是否为目标船。一旦找到目标,我们使用简单的几何原理进行定位。 我们在MBZIRC2023竞赛期间的真实世界实验中展示了这种方法,并将其集成到一个完全自主的系统中,在该系统中不存在GNSS导航的情况下运行。此外,我们也评估了视角对检测准确性和定位精度的影响,并与最优方法进行了比较。
https://arxiv.org/abs/2507.07153
Object navigation in open-world environments remains a formidable and pervasive challenge for robotic systems, particularly when it comes to executing long-horizon tasks that require both open-world object detection and high-level task planning. Traditional methods often struggle to integrate these components effectively, and this limits their capability to deal with complex, long-range navigation missions. In this paper, we propose LOVON, a novel framework that integrates large language models (LLMs) for hierarchical task planning with open-vocabulary visual detection models, tailored for effective long-range object navigation in dynamic, unstructured environments. To tackle real-world challenges including visual jittering, blind zones, and temporary target loss, we design dedicated solutions such as Laplacian Variance Filtering for visual stabilization. We also develop a functional execution logic for the robot that guarantees LOVON's capabilities in autonomous navigation, task adaptation, and robust task completion. Extensive evaluations demonstrate the successful completion of long-sequence tasks involving real-time detection, search, and navigation toward open-vocabulary dynamic targets. Furthermore, real-world experiments across different legged robots (Unitree Go2, B2, and H1-2) showcase the compatibility and appealing plug-and-play feature of LOVON.
在开放世界环境中,对象导航对机器人系统来说仍然是一个艰巨且普遍的挑战,特别是在执行需要开放式世界目标检测和高层次任务规划的长时任务时。传统方法往往难以有效整合这些组件,从而限制了它们处理复杂、远程导航任务的能力。本文提出了一种新的框架LOVON,该框架结合了大型语言模型(LLMs)进行分层任务规划与开放词汇视觉检测模型,并针对动态和无结构环境中的长期目标导航进行了优化设计。 为了应对包括视觉抖动、盲区和临时目标丢失在内的现实世界挑战,我们专门设计了解决方案如拉普拉斯方差过滤器以实现图像稳定。此外,还开发了一种功能执行逻辑,确保LOVON在自主导航、任务适应性和稳健的任务完成方面的能力。广泛的评估表明,在实时检测、搜索和导航向开放词汇动态目标的长序列任务中能够成功完成这些任务。 跨不同腿式机器人(如Unitree Go2、B2和H1-2)的真实世界实验展示了LOVON的高度兼容性及其引人注目的即插即用特性。
https://arxiv.org/abs/2507.06747
Smart data selection is becoming increasingly important in data-driven machine learning. Active learning offers a promising solution by allowing machine learning models to be effectively trained with optimal data including the most informative samples from large datasets. Wildlife data captured by camera traps are excessive in volume, requiring tremendous effort in data labelling and animal detection models training. Therefore, applying active learning to optimise the amount of labelled data would be a great aid in enabling automated wildlife monitoring and conservation. However, existing active learning techniques require that a machine learning model (i.e., an object detector) be fully accessible, limiting the applicability of the techniques. In this paper, we propose a model-agnostic active learning approach for detection of animals captured by camera traps. Our approach integrates uncertainty and diversity quantities of samples at both the object-based and image-based levels into the active learning sample selection process. We validate our approach in a benchmark animal dataset. Experimental results demonstrate that, using only 30% of the training data selected by our approach, a state-of-the-art animal detector can achieve a performance of equal or greater than that with the use of the complete training dataset.
智能数据选择在数据驱动的机器学习中变得越来越重要。主动学习通过允许机器学习模型使用包括从大型数据集中挑选出最具信息量样本在内的最优数据进行有效训练,提供了一种有前景的解决方案。野生动物监测摄像头捕捉到的数据量庞大,需要大量的标签标注工作和动物检测模型训练的工作量。因此,在标记数据的数量优化方面应用主动学习将极大地促进自动化野生动物监控与保护工作的开展。然而,现有的主动学习技术要求机器学习模型(即对象检测器)完全可访问,这限制了这些技术的应用范围。 本文中我们提出了一种针对摄像头捕捉动物的检测任务所设计的、不受具体模型限制的主动学习方法。我们的方法在主动学习样本选择过程中结合了基于物体和图像水平上的不确定性和多样性的量度。我们在一个基准动物数据集上验证了该方法的有效性。实验结果表明,使用我们方法选定的仅占全部训练数据30%的数据,最先进的动物检测器即可达到与使用完整训练数据集相同或更高的性能水平。
https://arxiv.org/abs/2507.06537
Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.
开放词汇人类对象交互(HOI)检测是一项具有挑战性的任务,旨在识别图像中所有感兴趣的 `<human, verb, object>` 三元组,即使这些三元组在训练集中未被预先定义。现有方法通常依赖于大型视觉-语言模型(VLMs)生成的输出特征来增强互动表示的一般化能力。然而,由 VLM 产生的视觉特征是全局性的和粗粒度的,这与检测任务的本质相矛盾。 为了应对这一问题,我们提出了一种新颖的双边协作框架用于开放词汇 HOI 检测(BC-HOI)。该框架包含一个注意力偏差引导(ABG)组件,它根据由 HOI 检测器提供的注意力偏差指导 VLM 生成细粒度的实例级互动特征。同时,该框架还包括基于大型语言模型(LLM)的监督指导(LSG)组件,通过 VLM 的 LLM 部分为 HOI 检测器提供细粒度的标记级监督。LSG 增强了 ABG 生成高质量注意力偏差的能力。 我们在两个流行的基准上进行了广泛的实验:HICO-DET 和 V-COCO,在开放词汇和封闭设置中均取得了优越的表现。代码将在 Github 上发布。
https://arxiv.org/abs/2507.06510
High-speed vision sensing is essential for real-time perception in applications such as robotics, autonomous vehicles, and industrial automation. Traditional frame-based vision systems suffer from motion blur, high latency, and redundant data processing, limiting their performance in dynamic environments. Event cameras, which capture asynchronous brightness changes at the pixel level, offer a promising alternative but pose challenges in object detection due to sparse and noisy event streams. To address this, we propose an event autoencoder architecture that efficiently compresses and reconstructs event data while preserving critical spatial and temporal features. The proposed model employs convolutional encoding and incorporates adaptive threshold selection and a lightweight classifier to enhance recognition accuracy while reducing computational complexity. Experimental results on the existing Smart Event Face Dataset (SEFD) demonstrate that our approach achieves comparable accuracy to the YOLO-v4 model while utilizing up to $35.5\times$ fewer parameters. Implementations on embedded platforms, including Raspberry Pi 4B and NVIDIA Jetson Nano, show high frame rates ranging from 8 FPS up to 44.8 FPS. The proposed classifier exhibits up to 87.84x better FPS than the state-of-the-art and significantly improves event-based vision performance, making it ideal for low-power, high-speed applications in real-time edge computing.
高速视觉传感对于机器人、自动驾驶汽车和工业自动化等应用中的实时感知至关重要。传统的基于帧的视觉系统在动态环境中因运动模糊、高延迟以及冗余的数据处理而表现不佳,从而限制了其性能。事件相机通过捕捉像素级异步亮度变化提供了一种有前景的替代方案,但它们面临着由于稀疏和嘈杂的事件流而导致的目标检测挑战。为了解决这些问题,我们提出了一种基于事件的自编码器架构,该架构能够高效地压缩并重建事件数据,并保留关键的空间和时间特征。所提出的模型使用卷积编码,并结合了自适应阈值选择以及轻量级分类器来提高识别精度,同时降低计算复杂度。在现有的Smart Event Face Dataset (SEFD)上的实验结果表明,我们的方法达到了与YOLO-v4相当的准确性,但参数减少了高达35.5倍。 我们在嵌入式平台(包括Raspberry Pi 4B和NVIDIA Jetson Nano)上进行了实现,并展示了从每秒8帧到最高每秒44.8帧的高度帧率。所提出的分类器相比于现有最佳方案最多提高了87.84倍的每秒帧数,显著提升了基于事件的视觉性能,使其成为低功耗、高速度应用(如实时边缘计算)的理想选择。
https://arxiv.org/abs/2507.06459