Text-Based Person Search (TBPS) holds unique value in real-world surveillance bridging visual perception and language understanding, yet current paradigms utilizing pre-training models often fail to transfer effectively to complex open-world scenarios. The reliance on "Passive Observation" leads to multifaceted spurious correlations and spatial semantic misalignment, causing a lack of robustness against distribution shifts. To fundamentally resolve these defects, this paper proposes ICON (Invariant Counterfactual Optimization with Neuro-symbolic priors), a framework integrating causal and topological priors. First, we introduce Rule-Guided Spatial Intervention to strictly penalize sensitivity to bounding box noise, forcibly severing location shortcuts to achieve geometric invariance. Second, Counterfactual Context Disentanglement is implemented via semantic-driven background transplantation, compelling the model to ignore background interference for environmental independence. Then, we employ Saliency-Driven Semantic Regularization with adaptive masking to resolve local saliency bias and guarantee holistic completeness. Finally, Neuro-Symbolic Topological Alignment utilizes neuro-symbolic priors to constrain feature matching, ensuring activated regions are topologically consistent with human structural logic. Experimental results demonstrate that ICON not only maintains leading performance on standard benchmarks but also exhibits exceptional robustness against occlusion, background interference, and localization noise. This approach effectively advances the field by shifting from fitting statistical co-occurrences to learning causal invariance.
基于文本的人体搜索(TBPS)在现实世界的监控中具有独特的价值,它连接了视觉感知和语言理解。然而,当前利用预训练模型的范式通常无法有效地转移到复杂的开放世界场景中。对“被动观察”的依赖导致多方面的虚假关联及空间语义错位,这使得系统缺乏应对分布变化的鲁棒性。为了从根本上解决这些问题,本文提出了ICON(具有神经符号先验的不变反事实优化),这是一种结合因果和拓扑先验知识的框架。 首先,我们引入了规则引导的空间干预机制,严格惩罚对边界框噪声的敏感度,并强制切断位置捷径以实现几何上的不变性。其次,通过语义驱动的背景移植来实施反事实上下文解耦,迫使模型忽略背景干扰从而达到环境独立性。然后,利用自适应掩码进行显着性驱动的语义正则化,解决局部显着偏差,并保证整体完整性和一致性。最后,神经符号拓扑对齐利用了神经符号先验来约束特征匹配,确保激活区域与人体结构逻辑在拓扑上保持一致。 实验结果表明,ICON不仅在标准基准测试中维持领先的性能,而且表现出对抗遮挡、背景干扰和定位噪声的卓越鲁棒性。这种方法通过从拟合统计共现转变为学习因果不变性,有效地推动了该领域的发展。
https://arxiv.org/abs/2601.15931
Accurate prediction of traffic crash severity is critical for improving emergency response and public safety planning. Although recent large language models (LLMs) exhibit strong reasoning capabilities, their single-agent architectures often struggle with heterogeneous, domain-specific crash data and tend to generate biased or unstable predictions. To address these limitations, this paper proposes TransportAgents, a hybrid multi-agent framework that integrates category-specific LLM reasoning with a multilayer perceptron (MLP) integration module. Each specialized agent focuses on a particular subset of traffic information, such as demographics, environmental context, or incident details, to produce intermediate severity assessments that are subsequently fused into a unified prediction. Extensive experiments on two complementary U.S. datasets, the Consumer Product Safety Risk Management System (CPSRMS) and the National Electronic Injury Surveillance System (NEISS), demonstrate that TransportAgents consistently outperforms both traditional machine learning and advanced LLM-based baselines. Across three representative backbones, including closed-source models such as GPT-3.5 and GPT-4o, as well as open-source models such as LLaMA-3.3, the framework exhibits strong robustness, scalability, and cross-dataset generalizability. A supplementary distributional analysis further shows that TransportAgents produces more balanced and well-calibrated severity predictions than standard single-agent LLM approaches, highlighting its interpretability and reliability for safety-critical decision support applications.
准确预测交通事故的严重程度对于改善紧急应对和公共安全规划至关重要。尽管最近的大规模语言模型(LLMs)展示了强大的推理能力,但它们单一代理架构在处理异构、领域特定的事故数据时往往力不从心,并且容易生成有偏或不稳定的结果。为了解决这些问题,本文提出了TransportAgents,这是一种混合多代理框架,它将类别特定的LLM推理与多层感知器(MLP)融合模块结合在一起。每个专门化的代理专注于交通信息的一个具体子集,如人口统计、环境背景或事故详情,以生成中间严重性评估,并随后将其整合为统一预测。 在两个互补的美国数据集上进行了广泛的实验:消费者产品安全风险管理系统(CPSRMS)和国家电子伤害监测系统(NEISS)。结果表明,TransportAgents在传统机器学习方法和先进的LLM基础模型之上始终表现出色。无论是在闭源模型(如GPT-3.5和GPT-4o)还是开源模型(如LLaMA-3.3)的基础上,该框架都显示出强大的稳健性、可扩展性和跨数据集的泛化能力。 补充分布分析进一步表明,TransportAgents生成了比标准单一代理LLM方法更平衡且校准良好的严重性预测结果。这突显了其在安全关键决策支持应用中的解释能力和可靠性。
https://arxiv.org/abs/2601.15519
Object detection in video and image surveillance is a well-established yet rapidly evolving task, strongly influenced by recent deep learning advancements. This review summarises modern techniques by examining architectural innovations, generative model integration, and the use of temporal information to enhance robustness and accuracy. Unlike earlier surveys, it classifies methods based on core architectures, data processing strategies, and surveillance specific challenges such as dynamic environments, occlusions, lighting variations, and real-time requirements. The primary goal is to evaluate the current effectiveness of semantic object detection, while secondary aims include analysing deep learning models and their practical applications. The review covers CNN-based detectors, GAN-assisted approaches, and temporal fusion methods, highlighting how generative models support tasks such as reconstructing missing frames, reducing occlusions, and normalising illumination. It also outlines preprocessing pipelines, feature extraction progress, benchmarking datasets, and comparative evaluations. Finally, emerging trends in low-latency, efficient, and spatiotemporal learning approaches are identified for future research.
视频和图像监控中的目标检测是一项成熟但迅速发展的任务,受到了近期深度学习进展的强烈影响。这篇综述通过考察架构创新、生成模型集成以及利用时间信息来增强鲁棒性和准确性的方法,总结了现代技术。与早期调查不同的是,它根据核心架构、数据处理策略和特定于监控的挑战(如动态环境、遮挡、光照变化和实时需求)对方法进行分类。主要目标是评估语义目标检测当前的有效性,次要目标包括分析深度学习模型及其实际应用。综述涵盖了基于CNN的目标检测器、GAN辅助的方法以及时间融合方法,并强调生成模型如何支持重建缺失帧、减少遮挡及正常化照明等任务。此外,它还概述了预处理流水线、特征提取进展、基准数据集和比较评估。最后,指出了低延迟、高效且时空学习方法的新兴趋势,为未来的研究提供了方向。
https://arxiv.org/abs/2601.14677
Human-centric visual analysis plays a pivotal role in diverse applications, including surveillance, healthcare, and human-computer interaction. With the emergence of large-scale unlabeled human image datasets, there is an increasing need for a general unsupervised pre-training model capable of supporting diverse human-centric downstream tasks. To achieve this goal, we propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework designed for unsupervised pre-training in human-centric visual tasks. CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels. These multi-level semantic cues are then integrated into the learned visual representations, enriching their expressiveness and generalizability. Recognizing that different downstream tasks demand varying levels of semantic granularity, CLASP incorporates a Prompt-Controlled Mixture-of-Experts (MoE) module. MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability. Furthermore, CLASP employs a multi-task pre-training strategy, where part- and attribute-level pseudo-labels derived from CLIP guide the representation learning process. Extensive experiments across multiple benchmarks demonstrate that CLASP consistently outperforms existing unsupervised pre-training methods, advancing the field of human-centric visual analysis.
以人类为中心的视觉分析在监控、医疗保健和人机交互等多种应用中扮演着关键角色。随着大规模未标注的人体图像数据集的出现,对于能够支持各种以人为中心的下游任务的一般无监督预训练模型的需求日益增长。为了实现这一目标,我们提出了CLASP(由CLIP引导的可适应自我监督学习框架),这是一种专门用于人类为中心的视觉任务中无监督预训练的新颖框架。CLASP利用强大的视觉-语言模型CLIP来生成低级(如身体部位)和高级(如属性)语义伪标签。然后将这些多层级的语义线索整合到所学的视觉表示中,丰富了其表达能力和泛化能力。 认识到不同的下游任务需要不同程度的语义粒度,CLASP集成了由提示控制的专家混合(MoE)模块。MoE根据特定于任务的提示动态调整特征提取,以减轻潜在的特征冲突并增强可迁移性。此外,CLASP采用了多任务预训练策略,在该策略中,从CLIP衍生出来的部分级和属性级伪标签引导表示学习过程。 在多个基准测试中的广泛实验表明,CLASP持续优于现有的无监督预训练方法,从而推进了以人为中心的视觉分析领域的发展。
https://arxiv.org/abs/2601.13133
Detection of oil spills from satellite images is essential for both environmental surveillance and maritime safety. Traditional threshold-based methods frequently encounter performance degradation due to very high false alarm rates caused by look-alike phenomena such as wind slicks and ship wakes. Here, a hybrid deep learning model, DeepSegFusion, is presented for oil spill segmentation in Synthetic Aperture Radar (SAR) images. The model uses SegNet and DeepLabV3+ integrated with an attention-based feature fusion mechanism to achieve better boundary precision as well as improved contextual understanding. Results obtained on SAR oil spill datasets, including ALOS PALSAR imagery, confirm that the proposed DeepSegFusion model achieves an accuracy of 94.85%, an Intersection over Union (IoU) of 0.5685, and a ROC-AUC score of 0.9330. The proposed method delivers more than three times fewer false detections compared to individual baseline models and traditional non-segmentation methods, achieving a reduction of 64.4%. These results indicate that DeepSegFusion is a stable model under various marine conditions and can therefore be used in near real-time oil spill monitoring scenarios.
从卫星图像中检测油污对于环境监测和海上安全至关重要。传统的基于阈值的方法经常会因为类似现象(如风迹和船尾波)导致非常高的误报率而遇到性能下降的问题。本文提出了一种混合深度学习模型DeepSegFusion,用于合成孔径雷达(SAR)图像中的油污分割任务。该模型结合了SegNet和DeepLabV3+,并融入基于注意力机制的特征融合方法以实现更好的边界精度以及上下文理解能力。 在包括ALOS PALSAR在内的SAR油污数据集上进行的结果测试表明,所提出的DeepSegFusion模型达到了94.85%的准确率、0.5685的交并比(IoU)和0.9330的ROC-AUC评分。与单独的基础模型及传统的非分割方法相比,该方法减少了超过三倍的误检次数,并实现了64.4%的减少量。这些结果表明,DeepSegFusion在各种海洋条件下具有稳定性,因此可以在近实时油污监控场景中使用。
https://arxiv.org/abs/2601.12015
Intelligent surveillance systems often handle perceptual tasks such as object detection, facial recognition, and emotion analysis independently, but they lack a unified, adaptive runtime scheduler that dynamically allocates computational resources based on contextual triggers. This limits their holistic understanding and efficiency on low-power edge devices. To address this, we present a real-time multi-modal vision framework that integrates object detection, owner-specific face recognition, and emotion detection into a unified pipeline deployed on a Raspberry Pi 5 edge platform. The core of our system is an adaptive scheduling mechanism that reduces computational load by 65\% compared to continuous processing by selectively activating modules such as, YOLOv8n for object detection, a custom FaceNet-based embedding system for facial recognition, and DeepFace's CNN for emotion classification. Experimental results demonstrate the system's efficacy, with the object detection module achieving an Average Precision (AP) of 0.861, facial recognition attaining 88\% accuracy, and emotion detection showing strong discriminatory power (AUC up to 0.97 for specific emotions), while operating at 5.6 frames per second. Our work demonstrates that context-aware scheduling is the key to unlocking complex multi-modal AI on cost-effective edge hardware, making intelligent perception more accessible and privacy-preserving.
智能监控系统通常独立处理如物体检测、面部识别和情绪分析等感知任务,但它们缺乏一个统一的自适应运行时调度器,该调度器可以根据上下文触发动态分配计算资源。这限制了其在低功耗边缘设备上的整体理解和效率。为解决这一问题,我们提出了一种实时多模态视觉框架,该框架将物体检测、基于个人的脸部识别和情绪检测整合到一个统一的管道中,并部署在Raspberry Pi 5边缘平台上。 我们系统的核心是一个自适应调度机制,通过选择性激活模块(例如用于对象检测的YOLOv8n、用于面部识别的定制FaceNet嵌入系统以及用于情绪分类的DeepFace卷积神经网络)来减少连续处理所需的计算负载达65%。实验结果显示,该系统的有效性得到了证明:物体检测模块实现了0.861的平均精度(AP),面部识别达到了88%的准确率,而情绪检测对于特定情感显示出强大的区分能力(AUC高达0.97)。同时,在每秒处理5.6帧的情况下,系统运行稳定。 我们的工作表明,基于上下文感知的调度是解锁复杂多模态AI的关键,并在成本效益高的边缘硬件上实现智能感知,使之更加普及且具有隐私保护功能。
https://arxiv.org/abs/2601.11970
The rapid proliferation of airborne platforms, including commercial aircraft, drones, and UAVs, has intensified the need for real-time, automated threat assessment systems. Current approaches depend heavily on manual monitoring, resulting in limited scalability and operational inefficiencies. This work introduces a dual-task model based on EfficientNetB4 capable of performing airborne object classification and threat-level prediction simultaneously. To address the scarcity of clean, balanced training data, we constructed the AODTA Dataset by aggregating and refining multiple public sources. We benchmarked our approach on both the AVD Dataset and the newly developed AODTA Dataset and further compared performance against a ResNet-50 baseline, which consistently underperformed EfficientNetB4. Our EfficientNetB4 model achieved 96% accuracy in object classification and 90% accuracy in threat-level prediction, underscoring its promise for applications in surveillance, defense, and airspace management. Although the title references detection, this study focuses specifically on classification and threat-level inference using pre-localized airborne object images provided by existing datasets.
随着空中平台(包括商用飞机、无人机和无人飞行器)的迅速增多,实时自动威胁评估系统的需求变得更加迫切。目前的方法主要依赖于人工监控,这导致了可扩展性有限以及运营效率低下。本文介绍了一种基于EfficientNetB4的双任务模型,该模型能够同时执行空中物体分类与威胁等级预测。 为了解决干净、平衡训练数据不足的问题,我们通过整合和优化多个公共来源构建了AODTA数据集。我们在AVD数据集以及新开发的AODTA数据集上对我们的方法进行了基准测试,并且将其性能与ResNet-50基线模型进行了比较,后者在所有情况下均不如EfficientNetB4表现良好。我们的EfficientNetB4模型实现了96%的目标物体分类准确率和90%的威胁等级预测准确率,这表明它具有应用于监控、防御以及空域管理等领域的巨大潜力。 尽管标题提到的是检测问题,本研究主要关注利用现有数据集中提供的预定位空中物体图像进行分类与威胁等级推断。
https://arxiv.org/abs/2601.11907
Recent advances in video anomaly detection (VAD) mainly focus on ground-based surveillance or unmanned aerial vehicle (UAV) videos with static backgrounds, whereas research on UAV videos with dynamic backgrounds remains limited. Unlike static scenarios, dynamically captured UAV videos exhibit multi-source motion coupling, where the motion of objects and UAV-induced global motion are intricately intertwined. Consequently, existing methods may misclassify normal UAV movements as anomalies or fail to capture true anomalies concealed within dynamic backgrounds. Moreover, many approaches do not adequately address the joint modeling of inter-frame continuity and local spatial correlations across diverse temporal scales. To overcome these limitations, we propose the Frequency-Assisted Temporal Dilation Mamba (FTDMamba) network for UAV VAD, including two core components: (1) a Frequency Decoupled Spatiotemporal Correlation Module, which disentangles coupled motion patterns and models global spatiotemporal dependencies through frequency analysis; and (2) a Temporal Dilation Mamba Module, which leverages Mamba's sequence modeling capability to jointly learn fine-grained temporal dynamics and local spatial structures across multiple temporal receptive fields. Additionally, unlike existing UAV VAD datasets which focus on static backgrounds, we construct a large-scale Moving UAV VAD dataset (MUVAD), comprising 222,736 frames with 240 anomaly events across 12 anomaly types. Extensive experiments demonstrate that FTDMamba achieves state-of-the-art (SOTA) performance on two public static benchmarks and the new MUVAD dataset. The code and MUVAD dataset will be available at: this https URL.
近期的视频异常检测(VAD)研究主要集中在静态背景下的地面监控或无人机(UAV)视频上,而对于动态背景下无人机视频的研究则相对较少。与静态场景不同,在动态捕获的无人机视频中,物体运动和由无人机引起的全局运动相互交织,形成了多源运动耦合。因此,现有方法可能会误将正常的无人机移动视为异常情况,或者无法捕捉到隐藏在动态背景中的真实异常。此外,许多方法未能充分解决跨多个时间尺度的帧间连续性和局部空间相关性的联合建模问题。 为克服这些局限性,我们提出了频率辅助时序膨胀蟒蛇(FTDMamba)网络用于无人机视频异常检测,该网络包括两个核心组件: 1. 频率解耦时空关联模块:通过频率分析来分离耦合的运动模式,并建立全局时空依赖关系。 2. 时序膨胀蟒蛇模块:利用蟒蛇在序列建模中的能力,在多个时间感受野中同时学习细粒度的时间动态和局部空间结构。 此外,不同于现有专注于静态背景的无人机视频异常检测数据集,我们构建了一个大规模的移动无人机VAD(MUVAD)数据集,该数据集中包含240个异常事件及12种异常类型,在总计222,736帧中进行记录。广泛的实验表明,FTDMamba在两个公开静态基准测试和新的MUVAD数据集上均达到了最先进的性能。 代码与MUVAD数据集将在以下网址提供:[此链接](https://this-url.com)。
https://arxiv.org/abs/2601.11254
Detecting vulnerable road users (VRUs), particularly children and adolescents, in low light and adverse weather conditions remains a critical challenge in computer vision, surveillance, and autonomous vehicle systems. This paper presents a purpose-built lightweight object detection model designed to identify young pedestrians in various environmental scenarios. To address these challenges, our approach leverages thermal imaging from long-wave infrared (LWIR) cameras, which enhances detection reliability in conditions where traditional RGB cameras operating in the visible spectrum fail. Based on the YOLO11 architecture and customized for thermal detection, our model, termed LTV-YOLO (Lightweight Thermal Vision YOLO), is optimized for computational efficiency, accuracy and real-time performance on edge devices. By integrating separable convolutions in depth and a feature pyramid network (FPN), LTV-YOLO achieves strong performance in detecting small-scale, partially occluded, and thermally distinct VRUs while maintaining a compact architecture. This work contributes a practical and scalable solution to improve pedestrian safety in intelligent transportation systems, particularly in school zones, autonomous navigation, and smart city infrastructure. Unlike prior thermal detectors, our contribution is task-specific: a thermally only edge-capable design designed for young and small VRUs (children and distant adults). Although FPN and depthwise separable convolutions are standard components, their integration into a thermal-only pipeline optimized for short/occluded VRUs under adverse conditions is, to the best of our knowledge, novel.
在低光照和恶劣天气条件下检测易受伤害的道路使用者(VRU),特别是儿童和青少年,仍然是计算机视觉、监控系统及自动驾驶汽车领域的关键挑战。本文介绍了一种专为识别不同环境条件下的年轻行人而设计的轻量级目标检测模型。为了应对这些挑战,我们的方法采用了长波红外(LWIR)相机捕捉到的热成像数据,在传统RGB相机在可见光谱下表现不佳的情况下增强了检测可靠性。 基于YOLO11架构并针对热成像进行了定制化优化,我们开发了一种名为LTV-YOLO(轻量级热视觉YOLO)的模型。该模型旨在边缘设备上实现计算效率、精度和实时性能的优化。通过在深度方向集成可分离卷积以及引入特征金字塔网络(FPN),LTV-YOLO能够在保证紧凑架构的前提下,有效地检测小型化、部分遮挡及温度对比明显的VRU。 这项工作提出了一种实用且可扩展的解决方案,以提高智能交通系统中的行人安全,特别是在学校区域、自主导航和智慧城市基础设施的应用中。与以往的热成像探测器相比,我们的贡献具有特定任务特性:一种仅依赖于边缘计算能力的设计方案,专门针对年轻及小型VRU(如儿童和远距离成人)。尽管FPN和深度可分离卷积是标准组件,但将其整合到专为不良条件下的短小或部分遮挡的热成像管道中,并据我们所知,这一方法具有新颖性。
https://arxiv.org/abs/2601.11662
Civil aviation is a cornerstone of global transportation and commerce, and ensuring its safety, efficiency and customer satisfaction is paramount. Yet conventional Artificial Intelligence (AI) solutions in aviation remain siloed and narrow, focusing on isolated tasks or single modalities. They struggle to integrate heterogeneous data such as voice communications, radar tracks, sensor streams and textual reports, which limits situational awareness, adaptability, and real-time decision support. This paper introduces the vision of AviationLMM, a Large Multimodal foundation Model for civil aviation, designed to unify the heterogeneous data streams of civil aviation and enable understanding, reasoning, generation and agentic applications. We firstly identify the gaps between existing AI solutions and requirements. Secondly, we describe the model architecture that ingests multimodal inputs such as air-ground voice, surveillance, on-board telemetry, video and structured texts, and performs cross-modal alignment and fusion, and produces flexible outputs ranging from situation summaries and risk alerts to predictive diagnostics and multimodal incident reconstructions. In order to fully realize this vision, we identify key research opportunities to address, including data acquisition, alignment and fusion, pretraining, reasoning, trustworthiness, privacy, robustness to missing modalities, and synthetic scenario generation. By articulating the design and challenges of AviationLMM, we aim to boost the civil aviation foundation model progress and catalyze coordinated research efforts toward an integrated, trustworthy and privacy-preserving aviation AI ecosystem.
民用航空是全球交通运输和商业的重要基石,确保其安全、效率及客户满意度至关重要。然而,传统的航空领域人工智能(AI)解决方案仍存在孤岛效应,专注于孤立的任务或单一模态的应用,难以整合语音通信、雷达跟踪、传感器流和文本报告等异构数据。这限制了情况感知能力、适应性和实时决策支持。 本文介绍了AviationLMM的愿景——一种为民用航空设计的大规模多模态基础模型(Large Multimodal foundation Model)。该模型旨在统一民用航空中的异构数据流,实现理解、推理、生成和代理应用的能力。首先,我们识别现有AI解决方案与需求之间的差距;其次,描述了接受多种输入(如空地语音通信、监视信息、机载遥测数据、视频及结构化文本)的模型架构,并执行跨模态对齐与融合,生成从情况摘要到风险警告以及预测诊断和多模态事件重建等灵活输出。为了实现这一愿景,我们识别了需要解决的关键研究机会,包括数据获取、对齐与融合、预训练、推理、可信度、隐私保护、针对缺失模态的鲁棒性及合成场景生成。 通过阐明AviationLMM的设计及其挑战,我们的目标是推动民用航空基础模型的进步,并激发向一个集成化、值得信赖且尊重隐私的航空AI生态系统研究工作的协调努力。
https://arxiv.org/abs/2601.09105
Internet of underwater things (IoUT) is increasingly gathering attention with the aim of monitoring sea life and deep ocean environment, underwater surveillance as well as maintenance of underwater installments. However, conventional IoUT devices, reliant on battery power, face limitations in lifespan and pose environmental hazards upon disposal. This paper introduces a sustainable approach for simultaneous information uplink from the IoUT devices and acoustic energy transfer (AET) to the devices via an autonomous underwater vehicle (AUV), potentially enabling them to operate indefinitely. To tackle the time-sensitivity, we adopt age of information (AoI), and Jain's fairness index. We develop two deep-reinforcement learning (DRL) algorithms, offering a high-complexity, high-performance frequency division duplex (FDD) solution and a low-complexity, medium-performance time division duplex (TDD) approach. The results elucidate that the proposed FDD and TDD solutions significantly reduce the average AoI and boost the harvested energy as well as data collection fairness compared to baseline approaches.
水下物联网(IoUT)正逐渐吸引人们的关注,其目的是监测海洋生物和深海环境、进行水下监视以及维护海底设施。然而,传统的依赖电池供电的IoUT设备在使用寿命方面存在局限性,并且废弃后可能对环境造成危害。本文提出了一种可持续的方法,通过自主水下航行器(AUV)同时从IoUT设备上传信息并传输声能(AET),从而有可能使这些设备实现无限期运行。为应对时间敏感性问题,我们采用了信息时效(AoI)和Jain的公平指数作为评估指标。 为了优化性能,我们开发了两种深度强化学习(DRL)算法:一种是复杂度高但性能卓越的频分双工(FDD)解决方案;另一种则是复杂度低、性能中等的时间分集双工(TDD)方法。实验结果表明,所提出的FDD和TDD解决方案在降低平均信息时效的同时,也显著提高了收集到的能量及数据采集的公平性,相较于基线方法有了明显改进。
https://arxiv.org/abs/2601.08491
The rapid deployment of drones poses significant challenges for airspace management, security, and surveillance. Current detection and classification technologies, including cameras, LiDAR, and conventional radar systems, often struggle to reliably identify and differentiate drones, especially those of similar models, under diverse environmental conditions and at extended ranges. Moreover, low radar cross sections and clutter further complicate accurate drone identification. To address these limitations, we propose a novel drone classification method based on artificial micro-Doppler signatures encoded by resonant electromagnetic stickers attached to drone blades. These tags generate distinctive, configuration-specific radar returns, enabling robust identification. We develop a tailored convolutional neural network (CNN) capable of processing raw radar signals, achieving high classification accuracy. Extensive experiments were conducted both in anechoic chambers with 43 tag configurations and outdoors under realistic flight trajectories and noise conditions. Dimensionality reduction techniques, including Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP), provided insight into code separability and robustness. Our results demonstrate reliable drone classification performance at signal-to-noise ratios as low as 7 dB, indicating the feasibility of long-range detection with advanced surveillance radar systems. Preliminary range estimations indicate potential operational distances of several kilometers, suitable for critical applications such as airport airspace monitoring. The integration of electromagnetic tagging with machine learning enables scalable and efficient drone identification, paving the way for enhanced aerial traffic management and security in increasingly congested airspaces.
无人机的快速部署给空中交通管理和安全监控带来了重大挑战。当前的检测和分类技术,包括摄像头、LiDAR 和传统雷达系统,在各种环境条件下尤其是在远距离上可靠地识别和区分相似型号的无人机方面往往遇到困难。此外,低雷达截面以及杂波进一步复杂化了准确的无人机识别问题。为了解决这些限制,我们提出了一种基于人工微多普勒特征的新颖无人机分类方法,这种特征是由贴附在无人机旋翼上的共振电磁标签编码产生的。这些标签能够产生具有独特配置特性的雷达回波信号,从而实现稳健的识别能力。 为此,我们开发了一个专门用于处理原始雷达信号的卷积神经网络(CNN),该网络实现了高准确度分类效果。我们在消声室中进行了广泛的实验,使用了43种不同的标签配置,并在室外环境中通过实际飞行轨迹和噪声条件下进一步测试了系统的性能。降维技术如主成分分析(PCA)和均匀流形逼近及投影(UMAP)为我们提供了代码分离性和鲁棒性方面的见解。 我们的研究结果表明,在信噪比低至7dB的情况下,该系统仍能可靠地进行无人机分类,这预示着使用先进的监控雷达系统在长距离范围内检测无人机是可行的。初步的距离估计显示可能的操作范围为数公里,这对于机场空域监测等关键应用而言非常合适。 通过将电磁标记技术与机器学习相结合,这种方案能够实现大规模且高效的无人机识别,从而为进一步优化空中交通管理和增强安全措施铺平道路,在日益拥挤的领空环境下尤为重要。
https://arxiv.org/abs/2601.08042
Video Question Answering (VideoQA) models enhance understanding and interaction with audiovisual content, making it more accessible, searchable, and useful for a wide range of fields such as education, surveillance, entertainment, and content creation. Due to heavy compute requirements, most large visual language models (VLMs) for VideoQA rely on a fixed number of frames by uniformly sampling the video. However, this process does not pick important frames or capture the context of the video. We present a novel query-based selection of frames relevant to the questions based on the submodular mutual Information (SMI) functions. By replacing uniform frame sampling with query-based selection, our method ensures that the chosen frames provide complementary and essential visual information for accurate VideoQA. We evaluate our approach on the MVBench dataset, which spans a diverse set of multi-action video tasks. VideoQA accuracy on this dataset was assessed using two VLMs, namely Video-LLaVA and LLaVA-NeXT, both of which originally employed uniform frame sampling. Experiments were conducted using both uniform and query-based sampling strategies. An accuracy improvement of up to \textbf{4\%} was observed when using query-based frame selection over uniform sampling. Qualitative analysis further highlights that query-based selection, using SMI functions, consistently picks frames better aligned with the question. We opine that such query-based frame selection can enhance accuracy in a wide range of tasks that rely on only a subset of video frames.
视频问答(VideoQA)模型通过增强对音视频内容的理解和互动,使其在教育、监控、娱乐和内容创作等众多领域更加易于访问、可搜索且有用。然而,由于计算需求较大,大多数用于VideoQA的大规模视觉语言模型(VLMs)依赖于从视频中均匀抽样固定数量的帧。这一过程无法挑选出重要帧或捕捉到整个视频的情境。 我们提出了一种基于查询的选择相关帧的方法,这种方法是通过次模互信息(Submodular Mutual Information, SMI)函数来实现的。通过用查询驱动选择替换均匀采样,我们的方法确保所选帧能提供互补且对准确回答VideoQA至关重要的视觉信息。我们在MVBench数据集上评估了这一方法,该数据集涵盖了多动作视频任务的一系列多样场景。我们使用两个原本采用均匀帧采样的VLM模型——即Video-LLaVA和LLaVA-NeXT,在此数据集上的VideoQA准确率进行了评估。 实验分别在均匀抽样策略和查询驱动选择策略下进行,结果显示:使用基于查询的帧选择相比均匀抽样可以提高高达**4%**的准确性。定性分析进一步表明,基于查询的选择利用SMI函数能够更一致地挑选出与问题更好地对齐的关键帧。我们认为这种基于查询的帧选择方法可以在仅依赖视频中部分关键帧的任务中显著提升准确率。 这种方法不仅提升了VideoQA任务的表现,还为如何高效处理大规模视觉数据提供了新的思路和解决方案。
https://arxiv.org/abs/2601.07459
Embedded vision systems need efficient and robust image processing algorithms to perform real-time, with resource-constrained hardware. This research investigates image processing algorithms, specifically edge detection, corner detection, and blob detection, that are implemented on embedded processors, including DSPs and FPGAs. To address latency, accuracy and power consumption noted in the image processing literature, optimized algorithm architectures and quantization techniques are employed. In addition, optimal techniques for inter-frame redundancy removal and adaptive frame averaging are used to improve throughput with reasonable image quality. Simulations and hardware trials of the proposed approaches show marked improvements in the speed and energy efficiency of processing as compared to conventional implementations. The advances of this research facilitate a path for scalable and inexpensive embedded imaging systems for the automotive, surveillance, and robotics sectors, and underscore the benefit of co-designing algorithms and hardware architectures for practical real-time embedded vision applications.
嵌入式视觉系统需要高效且鲁棒的图像处理算法,以在资源受限的硬件上实现实时性能。本研究调查了在包括DSP和FPGA在内的嵌入式处理器上实现的图像处理算法,特别是边缘检测、角点检测和Blob(连通区域)检测算法。为解决图像处理文献中提到的延迟、准确性和功耗问题,采用了优化的算法架构和量化技术。此外,还使用了最佳的技术来消除帧间冗余并进行自适应帧平均,以在保证合理图像质量的前提下提高吞吐量。 对所提出方法的仿真和硬件试验表明,与传统实现相比,在处理速度和能效方面有了显著改进。本研究的进步为汽车、监控和机器人领域的可扩展且低成本的嵌入式成像系统铺平了道路,并强调了针对实际实时嵌入式视觉应用联合设计算法和硬件架构的好处。
https://arxiv.org/abs/2601.06243
Intelligent anomaly detection in dynamic visual environments requires reconciling real-time performance with semantic interpretability. Conventional approaches address only fragments of this challenge. Reconstruction-based models capture low-level deviations without contextual reasoning, object detectors provide speed but limited semantics, and large vision-language systems deliver interpretability at prohibitive computational cost. This work introduces a cascading multi-agent framework that unifies these complementary paradigms into a coherent and interpretable architecture. Early modules perform reconstruction-gated filtering and object-level assessment, while higher-level reasoning agents are selectively invoked to interpret semantically ambiguous events. The system employs adaptive escalation thresholds and a publish-subscribe communication backbone, enabling asynchronous coordination and scalable deployment across heterogeneous hardware. Extensive evaluation on large-scale monitoring data demonstrates that the proposed cascade achieves a threefold reduction in latency compared to direct vision-language inference, while maintaining high perceptual fidelity (PSNR = 38.3 dB, SSIM = 0.965) and consistent semantic labeling. The framework advances beyond conventional detection pipelines by combining early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy-efficient foundation for scalable intelligent visual monitoring.
在动态视觉环境中进行智能异常检测需要平衡实时性能与语义可解释性。传统方法只能部分解决这一挑战:基于重构的模型可以捕获低层次偏差,但缺乏上下文推理;目标检测器提供了速度优势,但在语义方面较为有限;而大型视觉-语言系统虽然能够提供语义解释能力,却伴随着高昂的计算成本。 本文提出了一种级联多代理框架,该框架将上述互补范式统一为一个连贯且可解释的架构。早期模块执行重构门控过滤和对象级别的评估,而在较高层次上运行的推理代理则在需要时被选择性地调用以解释语义模糊事件。系统采用自适应升级阈值以及发布-订阅通信骨干结构,使得异步协调与跨不同硬件平台的大规模部署成为可能。 通过对大规模监控数据进行广泛测试后发现,所提出的级联方法相比直接的视觉语言推理,在延迟方面减少了三倍,并且在保持高感知保真度(PSNR = 38.3 dB, SSIM = 0.965)的同时,语义标记也表现出了良好的一致性。 该框架超越了传统的检测管道,结合了早期退出效率、自适应多代理推理和可解释的异常归因功能,为大规模智能视觉监控建立了一个可重复且节能的基础架构。
https://arxiv.org/abs/2601.06204
Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.
自动识别事件和重复行为分析对于视频监控至关重要。然而,大多数现有的基于内容的视频检索基准测试主要关注场景级别的相似性,并且不评估在监控中所需的动作辨别能力。为了弥补这一不足,我们引入了SOVABench(监控相对车辆动作基准),这是一个从监控录像构建的真实世界检索基准,专注于与车辆相关的行动。SOVABench定义了两种评估协议(跨对和内对)来评价跨动作的区分能力和时间方向的理解力。虽然动作之间的区别对于人类观察者来说通常是直观的,但我们的实验表明,这些任务仍然为最先进的视觉和多模态模型带来了挑战。 通过利用多模态大型语言模型(MLLMs)在视觉推理和指令遵循方面的能力,我们提出了一种无需训练框架,用于从MLLM生成的描述中生产出具有解释性的图像和视频嵌入。该框架在SOVABench上实现了强大的性能,并且还在多个空间和计数基准测试中表现出色,这些地方对比度视觉语言模型常常失败。 该项目的代码、标注以及构建基准的说明已公开提供。
https://arxiv.org/abs/2601.04824
Robust long-term tracking of drone is a critical requirement for modern surveillance systems, given their increasing threat potential. While detector-based approaches typically achieve strong frame-level accuracy, they often suffer from temporal inconsistencies caused by frequent detection dropouts. Despite its practical relevance, research on RGB-based drone tracking is still limited and largely reliant on conventional motion models. Meanwhile, foundation models like SAMURAI have established their effectiveness across other domains, exhibiting strong category-agnostic tracking performance. However, their applicability in drone-specific scenarios has not been investigated yet. Motivated by this gap, we present the first systematic evaluation of SAMURAI's potential for robust drone tracking in urban surveillance settings. Furthermore, we introduce a detector-augmented extension of SAMURAI to mitigate sensitivity to bounding-box initialization and sequence length. Our findings demonstrate that the proposed extension significantly improves robustness in complex urban environments, with pronounced benefits in long-duration sequences - especially under drone exit-re-entry events. The incorporation of detector cues yields consistent gains over SAMURAI's zero-shot performance across datasets and metrics, with success rate improvements of up to +0.393 and FNR reductions of up to -0.475.
在现代监控系统中,无人机的长期稳健追踪是一项关键要求,鉴于其潜在威胁日益增加。虽然基于检测器的方法通常能够在帧级精度上表现出色,但它们往往因频繁出现的目标丢失而产生时间不一致性的问题。尽管这一研究具有实际意义,但在基于RGB图像的无人机跟踪领域,相关研究仍然有限,并且主要依赖传统的运动模型。与此同时,诸如SAMURAI这样的基础模型在其他领域的应用中已经证明了其有效性,表现出强大的类别无关追踪性能。然而,这些模型在特定于无人机场景的应用上尚未进行过探索。 鉴于这一空白,我们首次系统性地评估了SAMURAI在城市监控环境中实现稳健无人机跟踪的潜力。此外,为了缓解对边界框初始化和序列长度敏感的问题,我们还引入了SAMURAI的一个增强版本,并结合检测器的功能。我们的研究发现表明,所提出的扩展显著提高了复杂城市环境下的鲁棒性,尤其是在长时序追踪中表现尤为突出——特别是在无人机离开再返回的事件中。通过融合检测器提供的线索,在所有数据集和评估指标上,与SAMURAI的零样本性能相比,成功率提升了高达+0.393,假阴率减少了最多-0.475。
https://arxiv.org/abs/2601.04798
As artificial intelligence rapidly advances, society is increasingly captivated by promises of superhuman machines and seamless digital futures. Yet these visions often obscure mounting social, ethical, and psychological concerns tied to pervasive digital technologies - from surveillance to mental health crises. This article argues that a guiding ethos is urgently needed to navigate these transformations. Inspired by the lasting influence of the biblical Ten Commandments, a European interdisciplinary group has proposed "Ten Rules for the Digital World" - a novel ethical framework to help individuals and societies make prudent, human-centered decisions in the age of "supercharged" technology.
随着人工智能的迅速发展,社会越来越被超人类机器和无缝数字未来的承诺所吸引。然而,这些愿景往往掩盖了与普遍存在的数字技术相关的日益增长的社会、伦理和心理问题——从监视到心理健康危机。本文主张迫切需要一个指导原则来应对这些转变。受到《圣经》十诫持久影响的启发,一个欧洲跨学科小组提出了“数字世界的十大规则”——一个新的伦理框架,旨在帮助个人和社会在“超级充电”的技术时代做出明智而以人为本的决策。
https://arxiv.org/abs/2601.03709
Our results reveal that a well-regularized shallow architecture can serve as a highly competitive baseline across heterogeneous domains - from smart-city surveillance to agricultural variety classification - without requiring large GPUs or specialized pre-trained models. This work establishes a unified, reproducible benchmark for multiple Bangladeshi vision datasets and highlights the practical value of lightweight CNNs for real-world deployment in low-resource settings.
我们的研究结果表明,一个经过良好正则化的浅层架构可以在不同的领域(从智慧城市监控到农业品种分类)中作为具有竞争力的基准线,而无需使用大型GPU或专门的预训练模型。这项工作为多个孟加拉国视觉数据集建立了一个统一且可重复的基准,并强调了轻量级CNN在资源匮乏环境下的实际应用价值。
https://arxiv.org/abs/2601.03463
As conversational AI systems become increasingly integrated into everyday life, they raise pressing concerns about user autonomy, trust, and the commercial interests that influence their behavior. To address these concerns, this paper develops the Fake Friend Dilemma (FFD), a sociotechnical condition in which users place trust in AI agents that appear supportive while pursuing goals that are misaligned with the user's own. The FFD provides a critical framework for examining how anthropomorphic AI systems facilitate subtle forms of manipulation and exploitation. Drawing on literature in trust, AI alignment, and surveillance capitalism, we construct a typology of harms, including covert advertising, political propaganda, behavioral nudging, and surveillance. We then assess possible mitigation strategies, including both structural and technical interventions. By focusing on trust as a vector of asymmetrical power, the FFD offers a lens for understanding how AI systems may undermine user autonomy while maintaining the appearance of helpfulness.
随着对话式人工智能系统在日常生活中越来越普及,它们引发了关于用户自主性、信任以及影响其行为的商业利益等方面的紧迫问题。为了应对这些问题,本文提出了“假朋友困境”(FFD),这是一种社会技术状况,在这种状态下,用户会将信任寄托于表面上显得支持自己的AI代理,而这些AI的行为却与用户的个人目标不一致。FFD为探讨拟人化AI系统如何通过细微的方式操纵和剥削用户提供了一个批判性的框架。 本文借鉴了关于信任、人工智能一致性以及监视资本主义的相关文献,构建了一种危害分类法,包括隐性广告、政治宣传、行为诱导和监控等类型。随后,我们评估了一系列可能的缓解策略,涵盖了结构性和技术性的干预措施。 通过将信任视为一种不对称权力关系的表现,FFD提供了一个理解AI系统如何在表面上维持友好姿态的同时侵蚀用户自主权的独特视角。
https://arxiv.org/abs/2601.03222