Human trajectory forecasting is crucial in applications such as autonomous driving, robotics and surveillance. Accurate forecasting requires models to consider various factors, including social interactions, multi-modal predictions, pedestrian intention and environmental context. While existing methods account for these factors, they often overlook the impact of the environment, which leads to collisions with obstacles. This paper introduces ECAM (Environmental Collision Avoidance Module), a contrastive learning-based module to enhance collision avoidance ability with the environment. The proposed module can be integrated into existing trajectory forecasting models, improving their ability to generate collision-free predictions. We evaluate our method on the ETH/UCY dataset and quantitatively and qualitatively demonstrate its collision avoidance capabilities. Our experiments show that state-of-the-art methods significantly reduce (-40/50%) the collision rate when integrated with the proposed module. The code is available at this https URL.
人类轨迹预测在自动驾驶、机器人技术和监控等领域中至关重要。准确的预测需要模型考虑各种因素,包括社会互动、多模态预测、行人的意图以及环境背景。尽管现有的方法已经考虑到这些因素,但它们常常忽视了环境的影响,从而导致与障碍物相撞的问题。本文介绍了一种名为ECAM(环境碰撞避免模块)的新模块,这是一种基于对比学习的模块,旨在增强模型在复杂环境中避开碰撞的能力。 提出的ECAM模块可以集成到现有轨迹预测模型中,并提高这些模型生成无碰撞预测的能力。我们在ETH/UCY数据集上评估了我们的方法,并通过定量和定性分析展示了其避免碰撞的能力。实验结果表明,当将我们提出的方法与现有的最先进的方法结合使用时,能够显著减少(-40%/50%)碰撞率。 该研究的代码可在提供的网址上获取。
https://arxiv.org/abs/2506.09626
Abdominal aortic aneurysms (AAAs) are progressive focal dilatations of the abdominal aorta. AAAs may rupture, with a survival rate of only 20\%. Current clinical guidelines recommend elective surgical repair when the maximum AAA diameter exceeds 55 mm in men or 50 mm in women. Patients that do not meet these criteria are periodically monitored, with surveillance intervals based on the maximum AAA diameter. However, this diameter does not take into account the complex relation between the 3D AAA shape and its growth, making standardized intervals potentially unfit. Personalized AAA growth predictions could improve monitoring strategies. We propose to use an SE(3)-symmetric transformer model to predict AAA growth directly on the vascular model surface enriched with local, multi-physical features. In contrast to other works which have parameterized the AAA shape, this representation preserves the vascular surface's anatomical structure and geometric fidelity. We train our model using a longitudinal dataset of 113 computed tomography angiography (CTA) scans of 24 AAA patients at irregularly sampled intervals. After training, our model predicts AAA growth to the next scan moment with a median diameter error of 1.18 mm. We further demonstrate our model's utility to identify whether a patient will become eligible for elective repair within two years (acc = 0.93). Finally, we evaluate our model's generalization on an external validation set consisting of 25 CTAs from 7 AAA patients from a different hospital. Our results show that local directional AAA growth prediction from the vascular surface is feasible and may contribute to personalized surveillance strategies.
腹主动脉瘤(AAAs)是腹部主动脉的局部扩张病变。这些瘤体有可能破裂,且破裂后的生存率仅为20%。目前的临床指南建议,在最大AAAs直径超过55毫米(男性)或50毫米(女性)时进行择期手术修复。不满足这些条件的患者将定期接受监测,其监控间隔基于最大AAAs直径来确定。然而,仅根据这种直径测量值无法全面考虑3D AAAs形状与其生长之间的复杂关系,这使得标准化的时间间隔可能不适合所有情况。个性化预测AAAs生长速度可以改进监控策略。 我们提出使用一个SE(3)对称变换模型直接在血管模型表面(该表面丰富了局部多物理特性)上进行AAAs增长的预测。与将AAAs形状参数化的其他研究不同,这种表示方法保留了血管表面对解剖结构和几何精确度。我们的模型是通过24名患者113次不规则时间间隔采样的计算机断层扫描(CTA)纵向数据集训练的。 在完成训练后,我们的模型可以预测到下次扫描时刻AAAs的增长情况,并且直径误差中位数仅为1.18毫米。此外,我们还展示了该模型能够识别患者在未来两年内是否会达到需要进行择期手术修复的标准(准确率0.93)。 最后,在来自另一家医院的7名患者的25次CTA组成的外部验证集上评估了我们的模型泛化能力。结果显示,从血管表面预测局部方向上的AAAs增长是可行的,并且可以有助于制定个性化的监控策略。
https://arxiv.org/abs/2506.08729
Real-world surveillance often renders faces and license plates unrecognizable in individual low-resolution (LR) frames, hindering reliable identification. To advance temporal recognition models, we present FANVID, a novel video-based benchmark comprising nearly 1,463 LR clips (180 x 320, 20--60 FPS) featuring 63 identities and 49 license plates from three English-speaking countries. Each video includes distractor faces and plates, increasing task difficulty and realism. The dataset contains 31,096 manually verified bounding boxes and labels. FANVID defines two tasks: (1) face matching -- detecting LR faces and matching them to high-resolution mugshots, and (2) license plate recognition -- extracting text from LR plates without a predefined database. Videos are downsampled from high-resolution sources to ensure that faces and text are indecipherable in single frames, requiring models to exploit temporal information. We introduce evaluation metrics adapted from mean Average Precision at IoU > 0.5, prioritizing identity correctness for faces and character-level accuracy for text. A baseline method with pre-trained video super-resolution, detection, and recognition achieved performance scores of 0.58 (face matching) and 0.42 (plate recognition), highlighting both the feasibility and challenge of the tasks. FANVID's selection of faces and plates balances diversity with recognition challenge. We release the software for data access, evaluation, baseline, and annotation to support reproducibility and extension. FANVID aims to catalyze innovation in temporal modeling for LR recognition, with applications in surveillance, forensics, and autonomous vehicles.
现实中的监控系统常常无法在单一的低分辨率(LR)帧中识别出面部和车牌,这阻碍了可靠的个体身份确认。为了推进时间序列识别模型的发展,我们推出了FANVID,这是一个基于视频的新基准测试集,包括近1,463个低分辨率片段(180 x 320像素,每秒20至60帧),涵盖来自三个英语国家的63个人身份和49个车牌。每个视频中还包括干扰面孔和车牌,增加了任务难度和现实感。该数据集包含31,096个手动验证过的边界框及标签。 FANVID定义了两个任务:(1)面部匹配——在低分辨率图像中检测出面部并将其与高分辨率的照片进行匹配;(2)车牌识别——从低分辨率的车牌上提取文本信息,而无需预先设定数据库。视频是从高分辨率源降低采样率得到的,确保单帧中的面孔和文字无法辨认,从而迫使模型利用时间序列的信息。 我们还引入了基于平均精度(IoU > 0.5)的评估指标,侧重于面部身份识别正确性和字符级别的文本准确性。采用预训练视频超分辨率、检测及识别方法作为基线模型,在面部匹配任务中达到了0.58分,在车牌识别任务中则为0.42分,这既展示了这些任务的可行性也揭示了其挑战性。 FANVID在选择面孔和车牌时平衡了多样性与识别难度。我们发布了数据访问、评估、基线及标注软件,以支持研究的可重复性和扩展性。FANVID旨在激发低分辨率识别中的时间序列建模创新,在监控、法医分析以及自动驾驶车辆等领域具有应用潜力。
https://arxiv.org/abs/2506.07304
Surveillance systems play a critical role in security and reconnaissance, but their performance is often compromised by low-quality images and videos, leading to reduced accuracy in face recognition. Additionally, existing AI-based facial analysis models suffer from biases related to skin tone variations and partially occluded faces, further limiting their effectiveness in diverse real-world scenarios. These challenges are the results of data limitations and imbalances, where available training datasets lack sufficient diversity, resulting in unfair and unreliable facial recognition performance. To address these issues, we propose a data-driven platform that enhances surveillance capabilities by generating synthetic training data tailored to compensate for dataset biases. Our approach leverages deep learning-based facial attribute manipulation and reconstruction using autoencoders and Generative Adversarial Networks (GANs) to create diverse and high-quality facial datasets. Additionally, our system integrates an image enhancement module, improving the clarity of low-resolution or occluded faces in surveillance footage. We evaluate our approach using the CelebA dataset, demonstrating that the proposed platform enhances both training data diversity and model fairness. This work contributes to reducing bias in AI-based facial analysis and improving surveillance accuracy in challenging environments, leading to fairer and more reliable security applications.
监控系统在安全和侦察中扮演着关键角色,但其性能常常因图像和视频质量低劣而受损,导致面部识别的准确性降低。此外,现有的基于人工智能的面部分析模型受肤色差异及部分遮挡脸部的影响而产生偏差,进一步限制了它们在多样化现实场景中的有效性。这些挑战源于数据限制和不平衡问题,现有训练数据集缺乏足够的多样性,从而导致不公平且不可靠的面部识别性能。 为了解决这些问题,我们提出了一种基于数据驱动平台的方法,通过生成合成训练数据来增强监控能力,并弥补数据集偏差。我们的方法利用深度学习技术进行面部属性操作与重建,使用自动编码器和生成对抗网络(GAN)创建多样化的高质量面部图像数据集。此外,该系统还整合了一个图像增强模块,以提高低分辨率或部分遮挡的面部在监控录像中的清晰度。 我们通过CelebA数据集验证了这种方法的有效性,结果表明所提出的平台可以提升训练数据多样性并改善模型公平性。这项工作有助于减少基于人工智能的面部分析中的偏差,并在充满挑战的环境中提升监控系统的准确性,从而为更公正和可靠的安防应用提供支持。
https://arxiv.org/abs/2506.06578
Audiovisual segmentation (AVS) aims to identify visual regions corresponding to sound sources, playing a vital role in video understanding, surveillance, and human-computer interaction. Traditional AVS methods depend on large-scale pixel-level annotations, which are costly and time-consuming to obtain. To address this, we propose a novel zero-shot AVS framework that eliminates task-specific training by leveraging multiple pretrained models. Our approach integrates audio, vision, and text representations to bridge modality gaps, enabling precise sound source segmentation without AVS-specific annotations. We systematically explore different strategies for connecting pretrained models and evaluate their efficacy across multiple datasets. Experimental results demonstrate that our framework achieves state-of-the-art zero-shot AVS performance, highlighting the effectiveness of multimodal model integration for finegrained audiovisual segmentation.
音频视觉分割(AVS)旨在识别与声音来源相对应的视觉区域,在视频理解、监控和人机交互中发挥着重要作用。传统的AVS方法依赖于大规模像素级注释,这些注释获取起来既费时又昂贵。为了解决这一问题,我们提出了一种新颖的零样本音频视觉分割(Zero-shot AVS)框架,通过利用多个预训练模型来消除特定任务的训练需求。我们的方法整合了音频、视觉和文本表示,以弥合模态间的差距,并能够在没有专门针对AVS的注释的情况下进行精确的声音来源分割。我们系统地探索了连接预训练模型的不同策略,并在多个数据集上评估了它们的有效性。实验结果表明,我们的框架实现了最先进的零样本AVS性能,突显了多模式模型整合对于细粒度音频视觉分割的有效性。
https://arxiv.org/abs/2506.06537
Video anomaly detection (VAD) is crucial in scenarios such as surveillance and autonomous driving, where timely detection of unexpected activities is essential. Although existing methods have primarily focused on detecting anomalous objects in videos -- either by identifying anomalous frames or objects -- they often neglect finer-grained analysis, such as anomalous pixels, which limits their ability to capture a broader range of anomalies. To address this challenge, we propose a new framework called Track Any Anomalous Object (TAO), which introduces a granular video anomaly detection pipeline that, for the first time, integrates the detection of multiple fine-grained anomalous objects into a unified framework. Unlike methods that assign anomaly scores to every pixel, our approach transforms the problem into pixel-level tracking of anomalous objects. By linking anomaly scores to downstream tasks such as segmentation and tracking, our method removes the need for threshold tuning and achieves more precise anomaly localization in long and complex video sequences. Experiments demonstrate that TAO sets new benchmarks in accuracy and robustness. Project page available online.
视频异常检测(VAD)在监控和自动驾驶等场景中至关重要,及时发现意外活动是必不可少的。尽管现有的方法主要集中在识别视频中的异常对象——无论是通过找出异常帧还是异常物体——它们往往忽略了更细粒度的分析,如异常像素,这限制了其捕捉更广泛异常的能力。为了解决这一挑战,我们提出了一种新的框架,称为“追踪任何异常对象”(TAO),它引入了一个颗粒化的视频异常检测管道,首次将多个细粒度异常物体的检测整合到统一的框架中。与那些对每个像素分配异常分数的方法不同,我们的方法将问题转化为对异常对象进行像素级别的跟踪。通过将异常分数链接到下游任务如分割和追踪,我们的方法消除了调整阈值的需求,并在长且复杂的视频序列中实现了更精确的异常定位。实验表明,TAO在准确性和鲁棒性方面设立了新的基准。项目页面在线可用。
https://arxiv.org/abs/2506.05175
Person Re-Identification (Re-ID) is a very important task in video surveillance systems such as tracking people, finding people in public places, or analysing customer behavior in supermarkets. Although there have been many works to solve this problem, there are still remaining challenges such as large-scale datasets, imbalanced data, viewpoint, fine grained data (attributes), the Local Features are not employed at semantic level in online stage of Re-ID task, furthermore, the imbalanced data problem of attributes are not taken into consideration. This paper has proposed a Unified Re-ID system consisted of three main modules such as Pedestrian Attribute Ontology (PAO), Local Multi-task DCNN (Local MDCNN), Imbalance Data Solver (IDS). The new main point of our Re-ID system is the power of mutual support of PAO, Local MDCNN and IDS to exploit the inner-group correlations of attributes and pre-filter the mismatch candidates from Gallery set based on semantic information as Fashion Attributes and Facial Attributes, to solve the imbalanced data of attributes without adjusting network architecture and data augmentation. We experimented on the well-known Market1501 dataset. The experimental results have shown the effectiveness of our Re-ID system and it could achieve the higher performance on Market1501 dataset in comparison to some state-of-the-art Re-ID methods.
人重新识别(Re-ID)是视频监控系统中的一个非常重要的任务,例如跟踪人员、在公共场所寻找人员或分析超市中顾客的行为。尽管已经有许多研究致力于解决这一问题,但仍存在一些挑战,如大规模数据集的处理、数据不平衡、视角变化以及细粒度数据(属性)等问题。此外,在人重新识别任务的在线阶段,局部特征并未被用于语义层面的应用,同时,属性的数据不平衡问题也未得到充分考虑。 本文提出了一种统一的人重新识别系统,该系统由三个主要模块组成:行人属性本体论(PAO)、局部多任务深度卷积神经网络(Local MDCNN)和数据不平衡求解器(IDS)。我们的人重新识别系统的创新点在于,这三个模块——即PAO、Local MDCNN 和 IDS 能够相互支持,通过利用属性之间的内部分组相关性,并基于语义信息如时尚属性和面部属性对Gallery集合中的不匹配候选者进行预筛选,以解决属性数据不平衡问题而不调整网络架构或数据增强。 我们在著名的Market1501数据集上进行了实验。实验结果表明了我们的人重新识别系统的效果显著,在Market1501数据集上的性能优于一些最新的Re-ID方法。
https://arxiv.org/abs/2506.04143
The quantification of social science remains a longstanding challenge, largely due to the philosophical nature of its foundational theories. Although quantum computing has advanced rapidly in recent years, its relevance to social theory remains underexplored. Most existing research focuses on micro-cognitive models or philosophical analogies, leaving a gap in system-level applications of quantum principles to the analysis of social systems. This study addresses that gap by proposing a theoretical and computational framework that combines quantum mechanics with Generative AI to simulate the emergence and evolution of social norms. Drawing on core quantum concepts--such as superposition, entanglement, and probabilistic measurement--this research models society as a dynamic, uncertain system and sets up five ideal-type experiments. These scenarios are simulated using 25 generative agents, each assigned evolving roles as compliers, resistors, or enforcers. Within a simulated environment monitored by a central observer (the Watcher), agents interact, respond to surveillance, and adapt to periodic normative disruptions. These interactions allow the system to self-organize under external stress and reveal emergent patterns. Key findings show that quantum principles, when integrated with generative AI, enable the modeling of uncertainty, emergence, and interdependence in complex social systems. Simulations reveal patterns including convergence toward normative order, the spread of resistance, and the spontaneous emergence of new equilibria in social rules. In conclusion, this study introduces a novel computational lens that lays the groundwork for a quantum-informed social theory. It offers interdisciplinary insights into how society can be understood not just as a structure to observe but as a dynamic system to simulate and redesign through quantum technologies.
社会科学的量化长期以来一直是一个挑战,主要是因为其基础理论具有哲学性。尽管量子计算近年来取得了迅速发展,但其对社会理论的相关性仍未得到充分探索。目前大多数现有研究主要集中在微观认知模型或哲学类比上,而在系统层面应用量子原理来分析社会体系的研究则相对较少。本研究通过提出一个结合了量子力学与生成式人工智能的理论和计算框架来填补这一空白,旨在模拟社会规范的出现和发展过程。 这项研究基于核心量子概念——如叠加态、纠缠以及概率测量等,将社会视为动态且不确定的系统,并设计了五个理想实验场景。在这些场景中,使用25个生成代理(agents)进行仿真,每个代理被分配为执行者、抵抗者或监管者的角色,其身份会随时间变化。在一个由中央观察员(Watcher)监控的模拟环境中,这些代理相互作用,对监视做出反应,并适应周期性的规范性干扰。 在外部压力下系统能够自我组织并显现涌现模式的过程中,这种交互式环境使得研究得以展开。关键发现表明,当量子原理与生成式人工智能结合时,可以建模复杂社会系统的不确定性、涌现性和相互依赖性。仿真揭示了包括向规范秩序趋同、抵抗的传播以及在社会规则中自发产生的新平衡点在内的多种模式。 总的来说,这项研究引入了一种新颖的计算视角,为基于量子理论的社会学奠定了基础。它提供了跨学科见解,展示了如何通过量子技术将社会理解不仅仅是一个需要观察的结构,而是一个可以模拟和重新设计的动力系统。
https://arxiv.org/abs/2506.03503
Visual Object Tracking (VOT) is a fundamental task with widespread applications in autonomous navigation, surveillance, and maritime robotics. Despite significant advances in generic object tracking, maritime environments continue to present unique challenges, including specular water reflections, low-contrast targets, dynamically changing backgrounds, and frequent occlusions. These complexities significantly degrade the performance of state-of-the-art tracking algorithms, highlighting the need for domain-specific datasets. To address this gap, we introduce the Maritime Visual Tracking Dataset (MVTD), a comprehensive and publicly available benchmark specifically designed for maritime VOT. MVTD comprises 182 high-resolution video sequences, totaling approximately 150,000 frames, and includes four representative object classes: boat, ship, sailboat, and unmanned surface vehicle (USV). The dataset captures a diverse range of operational conditions and maritime scenarios, reflecting the real-world complexities of maritime environments. We evaluated 14 recent SOTA tracking algorithms on the MVTD benchmark and observed substantial performance degradation compared to their performance on general-purpose datasets. However, when fine-tuned on MVTD, these models demonstrate significant performance gains, underscoring the effectiveness of domain adaptation and the importance of transfer learning in specialized tracking contexts. The MVTD dataset fills a critical gap in the visual tracking community by providing a realistic and challenging benchmark for maritime scenarios. Dataset and Source Code can be accessed here "this https URL.
视觉对象跟踪(VOT)是一项具有广泛应用的基础任务,包括自主导航、监控和海事机器人技术。尽管通用目标追踪方面取得了重大进展,但海洋环境仍面临着独特的挑战,例如镜面水反射、低对比度目标、动态变化的背景以及频繁遮挡现象。这些复杂性显著降低了现有跟踪算法的性能,凸显了领域特定数据集的需求。 为了解决这一缺口,我们引入了海事视觉追踪数据集(MVTD),这是一个专为海洋VOT设计的全面且公开可用的基准测试库。MVTD包含182个高分辨率视频序列,总计约150,000帧,并包括四个代表性目标类别:船只、帆船和无人水面舰艇(USV)。该数据集捕捉了多样化的操作条件和海洋场景,反映了海事环境中实际存在的复杂性。 我们在MVTD基准测试上评估了14种最新的SOTA跟踪算法,发现与它们在通用数据集上的表现相比,性能显著下降。然而,在使用MVTD进行微调后,这些模型表现出明显的性能提升,这强调了领域适应的有效性和在专业追踪场景中转移学习的重要性。 MVTD数据集通过提供现实且具有挑战性的基准测试填补了视觉跟踪社区中的关键空白,为海洋场景的应用提供了支持。该数据集和源代码可在此链接处访问:"this https URL"。
https://arxiv.org/abs/2506.02866
Benchmark object detection (OD) datasets play a pivotal role in advancing computer vision applications such as autonomous driving, and surveillance, as well as in training and evaluating deep learning-based state-of-the-art detection models. Among them, MS-COCO has become a standard benchmark due to its diverse object categories and complex scenes. However, despite its wide adoption, MS-COCO suffers from various annotation issues, including missing labels, incorrect class assignments, inaccurate bounding boxes, duplicate labels, and group labeling inconsistencies. These errors not only hinder model training but also degrade the reliability and generalization of OD models. To address these challenges, we propose a comprehensive refinement framework and present MJ-COCO, a newly re-annotated version of MS-COCO. Our approach begins with loss and gradient-based error detection to identify potentially mislabeled or hard-to-learn samples. Next, we apply a four-stage pseudo-labeling refinement process: (1) bounding box generation using invertible transformations, (2) IoU-based duplicate removal and confidence merging, (3) class consistency verification via expert objects recognizer, and (4) spatial adjustment based on object region activation map analysis. This integrated pipeline enables scalable and accurate correction of annotation errors without manual re-labeling. Extensive experiments were conducted across four validation datasets: MS-COCO, Sama COCO, Objects365, and PASCAL VOC. Models trained on MJ-COCO consistently outperformed those trained on MS-COCO, achieving improvements in Average Precision (AP) and APS metrics. MJ-COCO also demonstrated significant gains in annotation coverage: for example, the number of small object annotations increased by more than 200,000 compared to MS-COCO.
基准目标检测(OD)数据集在推进自动驾驶和监控等计算机视觉应用以及训练和评估基于深度学习的最新检测模型方面发挥着关键作用。其中,MS-COCO 因其多样化的对象类别和复杂的场景而成为标准基准。然而,尽管 MS-COCO 广泛采用,但它仍存在各种标注问题,包括缺少标签、错误的类别分配、不准确的边界框、重复标签以及群体标注的一致性问题。这些问题不仅阻碍了模型训练,还降低了目标检测模型的可靠性和泛化能力。为了解决这些挑战,我们提出了一种全面的改进框架,并推出了 MJ-COCO,这是 MS-COCO 经重新标注的新版本。我们的方法从基于损失和梯度的错误检测开始,以识别可能被误标或难以学习的数据样本。接下来,我们应用了一个四阶段伪标签优化过程:(1) 使用可逆变换生成边界框;(2) 通过 IoU 去除重复标签并合并置信度;(3) 通过专家对象识别器验证类别一致性;以及 (4) 根据目标区域激活图进行空间调整。这一集成管道使得在没有人工重新标注的情况下大规模且准确地纠正注释错误成为可能。我们在四个验证数据集(MS-COCO、Sama COCO、Objects365 和 PASCAL VOC)上进行了广泛的实验,结果显示,在 MJ-COCO 上训练的模型始终优于 MS-COCO 训练的模型,并在平均精度 (AP) 和 APS 指标上实现了显著改进。MJ-COCO 在标注覆盖范围方面也取得了重大进展:例如,相较于 MS-COCO,小目标注释的数量增加了超过 200,000 条。
https://arxiv.org/abs/2506.00997
The construction industry faces significant challenges in optimizing equipment utilization, as underused machinery leads to increased operational costs and project delays. Accurate and timely monitoring of equipment activity is therefore key to identifying idle periods and improving overall efficiency. This paper presents the Edge-IMI framework for detecting idle construction machinery, specifically designed for integration with surveillance camera systems. The proposed solution consists of three components: object detection, tracking, and idle state identification, which are tailored for execution on resource-constrained, CPU-based edge computing devices. The performance of Edge-IMI is evaluated using a combined dataset derived from the ACID and MOCS benchmarks. Experimental results confirm that the object detector achieves an F1 score of 71.75%, indicating robust real-world detection capabilities. The logistic regression-based idle identification module reliably distinguishes between active and idle machinery with minimal false positives. Integrating all three modules, Edge-IMI enables efficient on-site inference, reducing reliance on high-bandwidth cloud services and costly hardware accelerators. We also evaluate the performance of object detection models on Raspberry Pi 5 and an Intel NUC platforms, as example edge computing platforms. We assess the feasibility of real-time processing and the impact of model optimization techniques.
建筑行业在优化设备利用率方面面临重大挑战,因为未充分利用的机械设备会导致运营成本增加和项目延期。因此,准确且及时地监测设备活动是识别闲置时间并提高整体效率的关键。本文提出了Edge-IMI框架,专门用于检测闲置的施工机械,并设计为与监控摄像头系统集成。所提出的解决方案由三个组成部分构成:物体检测、跟踪和闲置状态识别,这些部分特别针对资源受限的CPU基边缘计算设备进行优化。 使用来自ACID和MOCS基准测试组合的数据集对Edge-IMI性能进行了评估。实验结果显示,物体检测器实现了71.75%的F1分数,表明其具有强大的实际环境检测能力。基于逻辑回归的闲置状态识别模块能够可靠地区分活动设备和闲置设备,并且误报率极低。通过集成所有三个模块,Edge-IMI能够在现场实现高效的推理,减少对高带宽云服务和昂贵硬件加速器的依赖。 我们还评估了在Raspberry Pi 5和Intel NUC平台(作为边缘计算示例)上物体检测模型的性能,并研究实时处理的可能性以及模型优化技术的影响。
https://arxiv.org/abs/2506.00904
The desire for cameras with smaller form factors has recently lead to a push for exploring computational imaging systems with reduced optical complexity such as a smaller number of lens elements. Unfortunately such simplified optical systems usually suffer from severe aberrations, especially in off-axis regions, which can be difficult to correct purely in software. In this paper we introduce Fovea Stacking, a new type of imaging system that utilizes emerging dynamic optical components called deformable phase plates (DPPs) for localized aberration correction anywhere on the image sensor. By optimizing DPP deformations through a differentiable optical model, off-axis aberrations are corrected locally, producing a foveated image with enhanced sharpness at the fixation point - analogous to the eye's fovea. Stacking multiple such foveated images, each with a different fixation point, yields a composite image free from aberrations. To efficiently cover the entire field of view, we propose joint optimization of DPP deformations under imaging budget constraints. Due to the DPP device's non-linear behavior, we introduce a neural network-based control model for improved alignment between simulation-hardware performance. We further demonstrated that for extended depth-of-field imaging, fovea stacking outperforms traditional focus stacking in image quality. By integrating object detection or eye-tracking, the system can dynamically adjust the lens to track the object of interest-enabling real-time foveated video suitable for downstream applications such as surveillance or foveated virtual reality displays.
对更小尺寸相机的需求最近推动了探索采用简化光学系统(如减少透镜元件数量)的计算成像系统的趋势。然而,这种简化的光学系统通常在偏离中心区域会产生严重的像差,这些像差往往难以仅通过软件进行校正。本文介绍了 Fovea Stacking,这是一种新的成像系统,利用新兴动态光学组件——可变形相位板(DPPs)来对图像传感器上的任何位置进行局部像差校正。通过使用可微分的光学模型优化 DPP 的形变,在偏离中心区域可以实现局部校正,生成一个在注视点处增强清晰度的视觉聚焦图像,类似于眼睛中的黄斑区。叠加多个这样的视觉聚焦图像(每个都有不同的注视点)则可以得到一幅没有像差的合成图像。 为了高效地覆盖整个视野范围,我们提出了基于成像预算限制下共同优化 DPP 形变的方法。由于 DPP 设备的行为是非线性的,因此我们引入了一种基于神经网络的控制模型以改善模拟和硬件性能之间的对齐度。进一步的演示表明,在扩展景深成像的情况下,与传统的焦点堆叠相比,视觉聚焦叠加在图像质量方面表现出色。 通过集成物体检测或眼动追踪技术,系统可以动态调整镜头以跟踪感兴趣的对象,从而实现适用于下游应用(如监控或视觉聚焦虚拟现实显示器)的实时视觉聚焦视频。
https://arxiv.org/abs/2506.00716
Robust scene segmentation and keyframe extraction are essential preprocessing steps in video understanding pipelines, supporting tasks such as indexing, summarization, and semantic retrieval. However, existing methods often lack generalizability across diverse video types and durations. We present a unified, adaptive framework for automatic scene detection and keyframe selection that handles formats ranging from short-form media to long-form films, archival content, and surveillance footage. Our system dynamically selects segmentation policies based on video length: adaptive thresholding for short videos, hybrid strategies for mid-length ones, and interval-based splitting for extended recordings. This ensures consistent granularity and efficient processing across domains. For keyframe selection, we employ a lightweight module that scores sampled frames using a composite metric of sharpness, luminance, and temporal spread, avoiding complex saliency models while ensuring visual relevance. Designed for high-throughput workflows, the system is deployed in a commercial video analysis platform and has processed content from media, education, research, and security domains. It offers a scalable and interpretable solution suitable for downstream applications such as UI previews, embedding pipelines, and content filtering. We discuss practical implementation details and outline future enhancements, including audio-aware segmentation and reinforcement-learned frame scoring.
鲁棒的场景分割和关键帧提取是视频理解流程中的重要预处理步骤,支持诸如索引、摘要生成和语义检索等任务。然而,现有的方法往往在不同类型的视频(从短视频到长片电影、档案内容以及监控录像)上缺乏泛化能力。我们提出了一种统一的自适应框架,用于自动场景检测和关键帧选择,能够处理各种格式的视频。我们的系统根据视频长度动态选择分割策略:对于短视频采用自适应阈值法;中等长度的视频则使用混合策略;而对于长时间记录,则采用基于间隔的分隔方法。这确保了跨领域的粒度一致性和高效处理。 在关键帧选取方面,我们采用了一个轻量级模块,通过一个综合指标(包括锐度、亮度和时间跨度)来评分抽取的帧,并避免复杂的显著性模型的同时保证视觉相关性。为高吞吐量工作流程设计的该系统已在商用视频分析平台上部署并处理了来自媒体、教育、研究和安全领域的大量内容。它提供了一个可扩展且易于解释的解决方案,适用于下游应用如用户界面预览、嵌入式管道以及内容过滤。 文中详细讨论了实际实施细节,并概述了未来增强方向,包括音频感知分割和强化学习帧评分等技术。
https://arxiv.org/abs/2506.00667
Unmanned Aerial Vehicles (UAVs) are one of the most revolutionary inventions of 21st century. At the core of a UAV lies the central processing system that uses wireless signals to control their movement. The most popular UAVs are quadcopters that use a set of four motors, arranged as two on either side with opposite spin. An autonomous UAV is called a drone. Drones have been in service in the US army since the 90's for covert missions critical to national security. It would not be wrong to claim that drones make up an integral part of the national security and provide the most valuable service during surveillance operations. While UAVs are controlled using wireless signals, there reside some challenges that disrupt the operation of such vehicles such as signal quality and range, real time processing, human expertise, robust hardware and data security. These challenges can be solved by programming UAVs to be autonomous, using object detection and tracking, through Computer Vision algorithms. Computer Vision is an interdisciplinary field that seeks the use of deep learning to gain a high-level understanding of digital images and videos for the purpose of automating the task of human visual system. Using computer vision, algorithms for detecting and tracking various objects can be developed suitable to the hardware so as to allow real time processing for immediate judgement. This paper attempts to review the various approaches several authors have proposed for the purpose of autonomous navigation of UAVs by through various algorithms of object detection and tracking in real time, for the purpose of applications in various fields such as disaster management, dense area exploration, traffic vehicle surveillance etc.
无人驾驶飞行器(UAV,即无人机)是21世纪最具革命性的发明之一。在UAV的核心位置有一个中央处理系统,该系统使用无线信号来控制其运动。目前最受欢迎的UAV是四旋翼飞机,它由四个电机组成,两个位于一侧且旋转方向相反。自主运行的UAV被称为无人机。自90年代以来,美国军队一直在为国家安全的关键秘密任务中使用无人机服务。毫不夸张地说,无人机在国家安全中占据了重要地位,并在监视行动期间提供了最有价值的服务。尽管UAV可以通过无线信号进行控制,但诸如信号质量和范围、实时处理能力、人力专业技能、硬件的坚固性以及数据安全等挑战仍可能影响其操作性能。为了解决这些问题,可以编程使UAV自主运行,利用物体检测和跟踪技术,并通过计算机视觉算法实现。 计算机视觉是一个跨学科领域,旨在运用深度学习来理解数字图像和视频中的高层次内容,以便自动化人类视觉系统的工作任务。借助于计算机视觉,可以开发适用于特定硬件的实时处理对象检测与追踪算法,从而实现实时判断。 本文试图回顾多位作者为UAV自主导航目的通过各种实时物体检测和跟踪算法提出的多种方法,并探讨这些技术在灾害管理、密集区域探索、交通车辆监控等各个领域的应用。
https://arxiv.org/abs/2506.05378
In the surveillance and defense domain, multi-target detection and classification (MTD) is considered essential yet challenging due to heterogeneous inputs from diverse data sources and the computational complexity of algorithms designed for resource-constrained embedded devices, particularly for Al-based solutions. To address these challenges, we propose a feature fusion and knowledge-distilled framework for multi-modal MTD that leverages data fusion to enhance accuracy and employs knowledge distillation for improved domain adaptation. Specifically, our approach utilizes both RGB and thermal image inputs within a novel fusion-based multi-modal model, coupled with a distillation training pipeline. We formulate the problem as a posterior probability optimization task, which is solved through a multi-stage training pipeline supported by a composite loss function. This loss function effectively transfers knowledge from a teacher model to a student model. Experimental results demonstrate that our student model achieves approximately 95% of the teacher model's mean Average Precision while reducing inference time by approximately 50%, underscoring its suitability for practical MTD deployment scenarios.
在监控和防御领域,多目标检测与分类(MTD)被认为是非常重要但同时也极具挑战性的任务。这主要是由于来自各种数据源的异质输入以及为资源受限的嵌入式设备设计算法所面临的计算复杂性,特别是对于基于人工智能的解决方案来说更是如此。为了应对这些挑战,我们提出了一种特征融合和知识蒸馏框架,用于多模态MTD,该框架利用数据融合来提高准确性,并采用知识蒸馏以实现更好的领域适应性。 具体而言,我们的方法采用了新颖的基于融合的多模态模型,同时使用RGB图像和热成像输入,并结合了蒸馏训练流水线。我们将问题定义为后验概率优化任务,并通过复合损失函数支持的多阶段训练管道来解决这个问题。该损失函数有效地将知识从教师模型传递到学生模型。 实验结果表明,我们的学生模型达到了大约95%的教师模型的平均精度(mAP),同时将推理时间减少了约50%,这突显了它在实际MTD部署场景中的适用性。
https://arxiv.org/abs/2506.00365
Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at this https URL
深度推理是解决复杂任务的基础,尤其是在要求顺序理解和多模态理解的以视觉为中心的情境中。然而,现有的基准测试通常使用完全合成的、单一回合查询来评估代理(如AI系统),并且仅限于有限的视觉模式,缺乏用于评估实际环境中所需跨多个步骤的推理质量的框架。为了解决这个问题,我们引入了Agent-X,这是一个大规模的基准测试工具,旨在评估以视觉为中心的代理在真实世界多模态设置中的多步和深度推理能力。Agent-X包含828项具有现实视觉上下文的任务,包括图像、多图比较、视频以及说明性文本。这些任务涵盖了六个主要的代理环境:通用视觉推理、网络浏览、安全与监控、自动驾驶、体育及数学推理。我们的基准要求代理在各种环境中将工具使用与明确的分步决策相结合。此外,我们提出了一种精细粒度的步骤级评估框架,该框架评估每一步推理的正确性和逻辑连贯性以及在整个任务中工具使用的有效性。我们的结果表明,即使是表现最好的模型(包括GPT、Gemini和Qwen家族的模型)也难以解决多步视觉任务,在全链成功方面得分不到50%。这些发现突出了当前长上下文语言模型在推理和工具使用能力上的关键瓶颈,并指明了未来以视觉为中心的代理性推理模型的研究方向。我们的数据和代码可在以下网址公开获取:[此URL](请将实际链接放置在此处)。
https://arxiv.org/abs/2505.24876
The COVID-19 pandemic, caused by SARS-CoV-2, highlighted the critical need for accurate prediction of disease severity to optimize healthcare resource allocation and patient management. The spike protein, which facilitates viral entry into host cells, exhibits high mutation rates, particularly in the receptor-binding domain, influencing viral pathogenicity. Artificial intelligence approaches, such as deep learning, offer promising solutions for leveraging genomic and clinical data to predict disease outcomes. Objective: This study aimed to develop a hybrid CNN-LSTM deep learning model to predict COVID-19 severity using spike protein sequences and associated clinical metadata from South American patients. Methods: We retrieved 9,570 spike protein sequences from the GISAID database, of which 3,467 met inclusion criteria after standardization. The dataset included 2,313 severe and 1,154 mild cases. A feature engineering pipeline extracted features from sequences, while demographic and clinical variables were one-hot encoded. A hybrid CNN-LSTM architecture was trained, combining CNN layers for local pattern extraction and an LSTM layer for long-term dependency modeling. Results: The model achieved an F1 score of 82.92%, ROC-AUC of 0.9084, precision of 83.56%, and recall of 82.85%, demonstrating robust classification performance. Training stabilized at 85% accuracy with minimal overfitting. The most prevalent lineages (P.1, AY.99.2) and clades (GR, GK) aligned with regional epidemiological trends, suggesting potential associations between viral genetics and clinical outcomes. Conclusion: The CNN-LSTM hybrid model effectively predicted COVID-19 severity using spike protein sequences and clinical data, highlighting the utility of AI in genomic surveillance and precision public health. Despite limitations, this approach provides a framework for early severity prediction in future outbreaks.
由SARS-CoV-2引起的COVID-19大流行凸显了准确预测疾病严重程度以优化医疗资源分配和患者管理的迫切需求。刺突蛋白,促进病毒进入宿主细胞的关键成分,表现出高变异率,特别是在受体结合域中,这对病毒感染性有显著影响。人工智能方法,如深度学习技术,为利用基因组数据和临床信息来预测疾病结果提供了潜在解决方案。研究目的:本研究旨在开发一种混合CNN-LSTM深度学习模型,使用南美患者的刺突蛋白序列及其相关临床元数据预测COVID-19的严重程度。 **方法**: 我们从GISAID数据库中检索到9,570个刺突蛋白序列,其中3,467个在标准化后符合纳入标准。该数据集包括2,313例重症和1,154例轻症患者。通过特征工程管道提取了序列的特征,而人口统计学和临床变量则进行了独热编码处理。训练了一种混合CNN-LSTM架构模型,结合卷积神经网络层进行局部模式抽取以及长短期记忆(LSTM)层用于建模长期依赖性。 **结果**: 模型实现了F1分数82.92%,ROC-AUC值0.9084,精确度83.56%和召回率82.85%,展示了强大的分类性能。训练稳定在85%的准确性,且过度拟合现象最小化。最常见的谱系(P.1, AY.99.2)和亚系(GR, GK)与区域流行病学趋势一致,表明病毒遗传学可能与临床结果之间存在潜在关联。 **结论**: 混合CNN-LSTM模型成功地使用刺突蛋白序列及临床数据预测了COVID-19的严重程度,强调了AI在基因组监测和精准公共卫生活动中的实用性。尽管存在局限性,但该方法为未来疫情中早期预测疾病严重程度提供了一种框架。
https://arxiv.org/abs/2505.23879
Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at this https URL.
视频异常理解(VAU)对于智能城市、安全监控和灾害预警系统等应用至关重要,但由于其对细粒度时空感知以及在模糊条件下进行稳健推理的需求,这项任务仍然具有挑战性。尽管在异常检测方面有所进步,现有的方法通常缺乏可解释性,并且难以捕捉异常事件的因果关系和上下文背景。这种局限性还因评估异常场景中推理能力的全面基准测试缺失而进一步加剧。为了解决这些挑战,我们引入了VAU-R1,这是一个基于多模态大型语言模型(MLLM)的数据高效框架,通过强化微调(RFT)来增强异常推理。此外,我们提出了VAU-Bench,这是第一个针对视频异常推理的Chain-of-Thought基准测试,其特点包括多项选择题、详细的理由说明、时间标注和描述性字幕。 实证结果显示,VAU-R1在各种上下文中显著提高了问题回答准确率、时间定位精度以及推理一致性。我们的方法与基准测试共同为可解释性和推理意识的视频异常理解奠定了坚实的基础。我们的代码可在提供的链接中获取。
https://arxiv.org/abs/2505.23504
Multimodal large language models (MLLMs) demonstrate remarkable capabilities in handling complex multimodal tasks and are increasingly adopted in video understanding applications. However, their rapid advancement raises serious data privacy concerns, particularly given the potential inclusion of sensitive video content, such as personal recordings and surveillance footage, in their training datasets. Determining improperly used videos during training remains a critical and unresolved challenge. Despite considerable progress on membership inference attacks (MIAs) for text and image data in MLLMs, existing methods fail to generalize effectively to the video domain. These methods suffer from poor scalability as more frames are sampled and generally achieve negligible true positive rates at low false positive rates (TPR@Low FPR), mainly due to their failure to capture the inherent temporal variations of video frames and to account for model behavior differences as the number of frames varies. To address these challenges, we introduce Vid-SME, the first membership inference method tailored for video data used in video understanding LLMs (VULLMs). Vid-SME leverages the confidence of model output and integrates adaptive parameterization to compute Sharma-Mittal entropy (SME) for video inputs. By leveraging the SME difference between natural and temporally-reversed video frames, Vid-SME derives robust membership scores to determine whether a given video is part of the model's training set. Experiments on various self-trained and open-sourced VULLMs demonstrate the strong effectiveness of Vid-SME.
多模态大型语言模型(MLLMs)在处理复杂多模态任务方面表现出卓越的能力,并且越来越多地被应用于视频理解应用中。然而,这些模型的迅速发展引发了严重的数据隐私问题,尤其是考虑到训练数据集中可能包含敏感的视频内容,例如个人记录和监控录像。确定训练过程中不当使用的视频仍然是一个关键且未解决的问题。尽管在MLLMs中的文本和图像数据上对成员推理攻击(MIAs)取得了显著进展,但现有方法无法有效地推广到视频领域。这些方法因未能捕捉视频帧的内在时间变化以及随着帧数变化模型行为差异而出现可扩展性差的问题,并且通常在低假阳性率下的真正阳性率(TPR@Low FPR)非常小。 为了解决这些问题,我们引入了Vid-SME,这是第一个针对用于视频理解大型语言模型(VULLMs)的视频数据的成员推理方法。Vid-SME利用模型输出的信心,并结合自适应参数化来计算视频输入的Sharma-Mittal熵(SME)。通过利用自然视频帧和时间反转视频帧之间的SME差异,Vid-SME推导出稳健的成员分数,以确定给定视频是否属于模型的训练集。在各种自训练和开源VULLMs上的实验表明了Vid-SME的强大有效性。
https://arxiv.org/abs/2506.03179
Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications. ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds. Existing deep learning methods often suffer from information loss during downsampling and inefficient global context modeling. This paper presents SAMamba, a novel framework integrating SAM2's hierarchical feature learning with Mamba's selective sequence modeling. Key innovations include: (1) A Feature Selection Adapter (FS-Adapter) for efficient natural-to-infrared domain adaptation via dual-stage selection (token-level with a learnable task embedding and channel-wise adaptive transformations); (2) A Cross-Channel State-Space Interaction (CSI) module for efficient global context modeling with linear complexity using selective state space modeling; and (3) A Detail-Preserving Contextual Fusion (DPCF) module that adaptively combines multi-scale features with a gating mechanism to balance high-resolution and low-resolution feature contributions. SAMamba addresses core ISTD challenges by bridging the domain gap, maintaining fine-grained details, and efficiently modeling long-range dependencies. Experiments on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST datasets show SAMamba significantly outperforms state-of-the-art methods, especially in challenging scenarios with heterogeneous backgrounds and varying target scales. Code: this https URL.
红外小目标检测(ISTD)在军事、海事和早期预警应用中的远程监控中至关重要。ISTD面临的挑战是,目标通常只占据图像的0.15%以下,并且难以从复杂的背景中区分出来。现有的深度学习方法往往因为在下采样过程中信息丢失以及全局上下文建模效率低下而受到影响。本文介绍了SAMamba这一新颖框架,它将SAM2的分层特征学习与Mamba的选择性序列建模相结合。 该框架的关键创新包括: 1. 一种通过双阶段选择(标记级具有可学习的任务嵌入和通道适应性转换)实现高效自然到红外领域适应性的特征选择适配器(FS-Adapter)。 2. 一个利用选择状态空间建模以线性复杂度进行有效全局上下文建模的跨通道状态空间交互模块(CSI)。 3. 一种自适应组合多尺度特征并采用门控机制平衡高分辨率与低分辨率特征贡献的细节保留上下文融合(DPCF)模块。 SAMamba通过弥合领域差距、保持细粒度详情以及有效地对远程依赖性建模,解决了ISTD的核心挑战。在NUAA-SIRST、IRSTD-1k和NUDT-SIRST数据集上的实验表明,SAMamba显著优于现有最先进的方法,尤其是在背景复杂且目标尺度多变的具有挑战性的场景中表现尤为突出。 代码链接:[请在此处插入实际链接]
https://arxiv.org/abs/2505.23214