Infrared and visible image fusion (IVIF) aims to preserve thermal radiation information from infrared images while integrating texture details from visible images, enabling the capture of important features and hidden details of subjects in complex scenes and disturbed environments. Consequently, IVIF offers distinct advantages in practical applications such as video surveillance, night navigation, and target recognition. However, prevailing methods often face challenges in simultaneously capturing thermal region features and detailed information due to the disparate characteristics of infrared and visible images. Consequently, fusion outcomes frequently entail a compromise between thermal target area information and texture details. In this study, we introduce a novel heterogeneous dual-discriminator generative adversarial network (HDDGAN) to address this issue. Specifically, the generator is structured as a multi-scale skip-connected structure, facilitating the extraction of essential features from different source images. To enhance the information representation ability of the fusion result, an attention mechanism is employed to construct the information fusion layer within the generator, leveraging the disparities between the source images. Moreover, recognizing the distinct learning requirements of information in infrared and visible images, we design two discriminators with differing structures. This approach aims to guide the model to learn salient information from infrared images while simultaneously capturing detailed information from visible images. Extensive experiments conducted on various public datasets demonstrate the superiority of our proposed HDDGAN over other state-of-the-art (SOTA) algorithms, highlighting its enhanced potential for practical applications.
红外和可见图像融合(IVIF)旨在保留红外图像的热辐射信息,同时整合可见图像的纹理细节,从而使捕捉复杂场景和受干扰环境中的主题重要特征和隐藏细节成为可能。因此,在实际应用中,例如视频监控、夜间导航和目标识别,IVIF具有显著的优势。然而,由于红外和可见图像的差异特征,现有的方法在同时捕捉热区域特征和详细信息方面常常面临挑战。因此,融合结果通常需要在热目标区域信息与纹理细节之间做出权衡。在这项研究中,我们引入了一种新颖的异质双判别器生成对抗网络(HDDGAN)来解决这一问题。具体来说,生成器采用多尺度跳转连接结构,促进从不同源图像中提取关键特征。为了增强融合结果的信息表示能力,采用关注机制在生成器中构建信息融合层,利用源图像之间的差异。此外,考虑到红外和可见图像之间的不同学习需求,我们设计了两部分结构不同的判别器。这种方法旨在指导模型从红外图像中学习显著信息,同时从可见图像中捕捉详细信息。在各种公开数据集上进行的大量实验证明,与最先进的(SOTA)算法相比,我们提出的HDDGAN具有卓越的实用性能,强调了其在实际应用中的潜在优势。
https://arxiv.org/abs/2404.15992
Spatiotemporal networks' observational capabilities are crucial for accurate data gathering and informed decisions across multiple sectors. This study focuses on the Spatiotemporal Ranged Observer-Observable Bipartite Network (STROOBnet), linking observational nodes (e.g., surveillance cameras) to events within defined geographical regions, enabling efficient monitoring. Using data from Real-Time Crime Camera (RTCC) systems and Calls for Service (CFS) in New Orleans, where RTCC combats rising crime amidst reduced police presence, we address the network's initial observational imbalances. Aiming for uniform observational efficacy, we propose the Proximal Recurrence approach. It outperformed traditional clustering methods like k-means and DBSCAN by offering holistic event frequency and spatial consideration, enhancing observational coverage.
空间时间网络的观测能力对于准确收集数据和跨多个领域的明智决策至关重要。本研究重点关注Spatiotemporal Ranged Observer-Observable Bipartite Network (STROOBnet),该网络将观测节点(例如,监控摄像头)与定义地理区域内的事件相连,实现高效的监测。利用来自实时犯罪 camera(RTCC)系统和求救电话(CFS)的数据,其中RTCC 在新奥良面对减少警力上升犯罪的情况下,我们解决了网络的初始观测不平衡问题。为了实现统一的观测效果,我们提出了Proximal Recurrence方法。与传统的聚类方法(如k-means 和 DBSCAN)相比,该方法提供了全面的事件频率和空间考虑,提高了观测覆盖。
https://arxiv.org/abs/2404.14388
Hyperspectral imaging (HSI) is a key technology for earth observation, surveillance, medical imaging and diagnostics, astronomy and space exploration. The conventional technology for HSI in remote sensing applications is based on the push-broom scanning approach in which the camera records the spectral image of a stripe of the scene at a time, while the image is generated by the aggregation of measurements through time. In real-world airborne and spaceborne HSI instruments, some empty stripes would appear at certain locations, because platforms do not always maintain a constant programmed attitude, or have access to accurate digital elevation maps (DEM), and the travelling track is not necessarily aligned with the hyperspectral cameras at all times. This makes the enhancement of the acquired HS images from incomplete or corrupted observations an essential task. We introduce a novel HSI inpainting algorithm here, called Hyperspectral Equivariant Imaging (Hyper-EI). Hyper-EI is a self-supervised learning-based method which does not require training on extensive datasets or access to a pre-trained model. Experimental results show that the proposed method achieves state-of-the-art inpainting performance compared to the existing methods.
hyperspectral imaging(HSI)是一种用于地球观测、监控、医学成像和诊断、天文学和太空探索的关键技术。在遥感应用中,传统的HSI技术是基于扫描方法,即相机在一次拍摄中记录场景的条带光谱图像,然后通过时间累积测量结果来生成图像。在实际的航空和航天器HSI仪器中,在某些位置会看到一些空条带,因为平台并不总是保持恒定的程序化姿态,或者无法访问准确的数字高程图(DEM),并且飞行轨迹不一定与所有时刻的 hyperspectral 相机对齐。这使得从 incomplete 或 corrupted observations 中增强已获得 HS 图像成为一个必要任务。我们在这里介绍了一种名为 Hyperpectral Equivariant Imaging(Hyper-EI)的新型HSI修复算法。Hyper-EI是一种自监督学习方法,不需要在大量数据集上进行训练或访问预训练模型。实验结果表明,与现有方法相比,所提出的方法在修复效果方面实现了最先进的水平。
https://arxiv.org/abs/2404.13159
Deploying mobile robots in construction sites to collaborate with workers or perform automated tasks such as surveillance and inspections carries the potential to greatly increase productivity, reduce human errors, and save costs. However ensuring human safety is a major concern, and the rough and dynamic construction environments pose multiple challenges for robot deployment. In this paper, we present the insights we obtained from our collaborations with construction companies in Canada and discuss our experiences deploying a semi-autonomous mobile robot in real construction scenarios.
在建筑工地部署移动机器人与工人合作或执行自动任务,如监视和检查,具有大大提高生产率、降低人为错误和节省成本的潜力。然而,确保人身安全是一个重要问题,而且粗野和动态的建筑环境对机器人部署提出了多项挑战。在本文中,我们分享了我们与加拿大建筑公司在合作中获得的见解,并讨论了我们在大规模建筑场景中部署半自主移动机器人的经验。
https://arxiv.org/abs/2404.13143
Underwater images taken from autonomous underwater vehicles (AUV's) often suffer from low light, high turbidity, poor contrast, motion-blur and excessive light scattering and hence require image enhancement techniques for object recognition. Machine learning methods are being increasingly used for object recognition under such adverse conditions. These enhanced object recognition methods of images taken from AUV's has potential applications in underwater pipeline and optical fibre surveillance, ocean bed resource extraction, ocean floor mapping, underwater species exploration, etc. While the classical machine learning methods are very efficient in terms of accuracy, they require large datasets and high computational time for image classification. In the current work, we use quantum-classical hybrid machine learning methods for real-time under-water object recognition on-board an AUV for the first time. We use real-time motion-blurred and low-light images taken from an on-board camera of AUV built in-house and apply existing hybrid machine learning methods for object recognition. Our hybrid methods consist of quantum encoding and flattening of classical images using quantum circuits and sending them to classical neural networks for image classification. The results of hybrid methods carried out using Pennylane based quantum simulators both on GPU and using pre-trained models on an on-board NVIDIA GPU chipset are compared with results from corresponding classical machine learning methods. We observe that the hybrid quantum machine learning methods show an efficiency greater than 65\% and reduction in run-time by one-thirds and require 50\% smaller dataset sizes for training the models compared to classical machine learning methods. We hope that our work opens up further possibilities in quantum enhanced real-time computer vision in autonomous vehicles.
自主水下车辆(AUV)拍摄的水下图像通常存在低光、高浊度、对比度差、运动模糊和过度光线散射等问题,因此需要图像增强技术来进行目标识别。机器学习方法在AUV拍摄的水下图像目标识别方面得到了越来越多的应用。利用AUV拍摄的水下图像的增强目标识别方法具有潜在的应用,如水下管道和光纤监测、海底资源开采、海底地形图、水下物种探索等。尽管经典的机器学习方法在准确性方面非常有效,但它们需要大量数据和高的计算时间进行图像分类。在当前工作中,我们使用量子经典混合机器学习方法进行AUV上实时水下物体识别,这是第一次在AUV上实现。我们使用AUV自带相机上的实时运动模糊和低光图像,并应用现有的混合机器学习方法进行目标识别。我们的混合方法包括量子编码和经典图像平铺,利用量子电路对经典图像进行量子编码,并将其发送到经典神经网络进行图像分类。使用Pennylane基于量子模拟器的混合方法在GPU和预训练的模型上进行的结果与相应的经典机器学习方法的结果进行了比较。我们观察到,混合量子机器学习方法显示出比经典机器学习方法超过65%的效率,并且在运行时间上减少了三分之一,同时训练模型的数据集需要量比经典方法小50%。我们希望我们的工作为自主车辆的量子增强实时计算机视觉开辟更广阔的可能性。
https://arxiv.org/abs/2404.13130
Current methods for 3D reconstruction and environmental mapping frequently face challenges in achieving high precision, highlighting the need for practical and effective solutions. In response to this issue, our study introduces FlyNeRF, a system integrating Neural Radiance Fields (NeRF) with drone-based data acquisition for high-quality 3D reconstruction. Utilizing unmanned aerial vehicle (UAV) for capturing images and corresponding spatial coordinates, the obtained data is subsequently used for the initial NeRF-based 3D reconstruction of the environment. Further evaluation of the reconstruction render quality is accomplished by the image evaluation neural network developed within the scope of our system. According to the results of the image evaluation module, an autonomous algorithm determines the position for additional image capture, thereby improving the reconstruction quality. The neural network introduced for render quality assessment demonstrates an accuracy of 97%. Furthermore, our adaptive methodology enhances the overall reconstruction quality, resulting in an average improvement of 2.5 dB in Peak Signal-to-Noise Ratio (PSNR) for the 10% quantile. The FlyNeRF demonstrates promising results, offering advancements in such fields as environmental monitoring, surveillance, and digital twins, where high-fidelity 3D reconstructions are crucial.
目前用于3D建模和环境建模的方法通常很难实现高精度,这凸显了需要实际有效的解决方案。为了应对这个问题,我们的研究引入了FlyNeRF,一种将神经辐射场(NeRF)与无人机数据采集相结合的高质量3D建模系统。利用无人机捕获图像和相关空间坐标,然后将获得的数据用于环境中的最初NeRF-based 3D建模。通过系统内图像评估神经网络进一步评估建模渲染质量。根据图像评估模块的结果,自适应算法确定附加图像捕捉的位置,从而提高建模质量。用于建模质量评估的神经网络表现出97%的准确度。此外,我们的自适应方法提高了整体建模质量,使得10%分位数上的峰值信号-噪声比(PSNR)平均提高了2.5分贝。FlyNeRF显示出鼓舞人心的结果,为环境监测、监视和数字孪生等领域提供了进步,这些领域对高保真3D建模至关重要。
https://arxiv.org/abs/2404.12970
The Segment Anything Model (SAM) is a deep neural network foundational model designed to perform instance segmentation which has gained significant popularity given its zero-shot segmentation ability. SAM operates by generating masks based on various input prompts such as text, bounding boxes, points, or masks, introducing a novel methodology to overcome the constraints posed by dataset-specific scarcity. While SAM is trained on an extensive dataset, comprising ~11M images, it mostly consists of natural photographic images with only very limited images from other modalities. Whilst the rapid progress in visual infrared surveillance and X-ray security screening imaging technologies, driven forward by advances in deep learning, has significantly enhanced the ability to detect, classify and segment objects with high accuracy, it is not evident if the SAM zero-shot capabilities can be transferred to such modalities. This work assesses SAM capabilities in segmenting objects of interest in the X-ray/infrared modalities. Our approach reuses the pre-trained SAM with three different prompts: bounding box, centroid and random points. We present quantitative/qualitative results to showcase the performance on selected datasets. Our results show that SAM can segment objects in the X-ray modality when given a box prompt, but its performance varies for point prompts. Specifically, SAM performs poorly in segmenting slender objects and organic materials, such as plastic bottles. We find that infrared objects are also challenging to segment with point prompts given the low-contrast nature of this modality. This study shows that while SAM demonstrates outstanding zero-shot capabilities with box prompts, its performance ranges from moderate to poor for point prompts, indicating that special consideration on the cross-modal generalisation of SAM is needed when considering use on X-ray/infrared imagery.
Segment Anything Model (SAM)是一种深度神经网络基础模型,旨在执行实例分割,由于其零 shot分割能力而获得了显著的流行。SAM通过根据各种输入提示生成掩码来操作,引入了一种新的方法来克服数据集特异性稀疏性的限制。尽管SAM在广泛的训练数据集上进行训练,包括~11M张图像,但它主要由仅包含非常有限其他模态图像的自然摄影图像组成。尽管随着深度学习技术的进步,视觉红外监视和X射线安全筛选成像技术的发展,大大提高了检测、分类和分割物体的准确性,但目前尚不清楚SAM的零 shot分割能力是否可以应用到这种模态。 本文评估了SAM在X-ray/红外模态中分割物体的能力。我们的方法重用了预训练的SAM,并使用三种不同的提示:边界框、中心点和随机点。我们提供了定量/定性结果,以展示SAM在这些选定数据集上的性能。我们的结果表明,当给定边界框提示时,SAM可以在X-ray模态上分割物体,但性能因点提示而异。具体来说,SAM在分割细长物体和有机材料(如塑料瓶)方面表现不佳。我们发现,由于这种模态的低对比度性质,红外物体也难以通过点提示进行分割。 本研究显示,尽管SAM在边界框提示下表现出出色的零 shot能力,但其在点提示下的性能从中等到差,表明在考虑在X-ray/红外图像上使用SAM时,需要特别注意跨模态通用性。
https://arxiv.org/abs/2404.12285
Surveillance footage represents a valuable resource and opportunities for conducting gait analysis. However, the typical low quality and high noise levels in such footage can severely impact the accuracy of pose estimation algorithms, which are foundational for reliable gait analysis. Existing literature suggests a direct correlation between the efficacy of pose estimation and the subsequent gait analysis results. A common mitigation strategy involves fine-tuning pose estimation models on noisy data to improve robustness. However, this approach may degrade the downstream model's performance on the original high-quality data, leading to a trade-off that is undesirable in practice. We propose a processing pipeline that incorporates a task-targeted artifact correction model specifically designed to pre-process and enhance surveillance footage before pose estimation. Our artifact correction model is optimized to work alongside a state-of-the-art pose estimation network, HRNet, without requiring repeated fine-tuning of the pose estimation model. Furthermore, we propose a simple and robust method for obtaining low quality videos that are annotated with poses in an automatic manner with the purpose of training the artifact correction model. We systematically evaluate the performance of our artifact correction model against a range of noisy surveillance data and demonstrate that our approach not only achieves improved pose estimation on low-quality surveillance footage, but also preserves the integrity of the pose estimation on high resolution footage. Our experiments show a clear enhancement in gait analysis performance, supporting the viability of the proposed method as a superior alternative to direct fine-tuning strategies. Our contributions pave the way for more reliable gait analysis using surveillance data in real-world applications, regardless of data quality.
监视视频资料是一种宝贵的资源和进行姿态分析的机会。然而,这类视频的低质量和高噪声水平可能会严重影响姿态估计算法的准确性,这些算法是可靠姿态分析的基础。现有文献表明,姿态估计的有效性与后续的姿态分析结果之间存在直接关系。一种常见的缓解策略是在噪声数据上对姿态估计模型进行微调,以提高稳健性。然而,这种方法可能会在原始高质量数据上降低下游模型的性能,导致在实践中不必要的权衡。我们提出了一个处理流程,其中包含一个专门针对任务目标进行预处理和增强的监视视频处理模型。我们的预处理和增强模型与最先进的姿态估计网络——HRNet——协同工作,无需反复微调姿态估计模型。此外,我们提出了一种简单而鲁棒的方法,用于自动标注带有姿态的低质量视频,以训练预处理和增强模型。我们系统地评估了我们的预处理模型的性能,并证明我们的方法不仅能在低质量监视视频上实现 improved pose estimation,还能在高质量视频上保留姿态估计的完整性。我们的实验显示,我们的预处理模型在姿态分析性能上明显增强,支持了所提出的利用监视数据进行更可靠姿态分析作为直接微调策略的替代品。我们的贡献为使用监视数据进行更可靠姿态分析在现实应用中铺平道路,而无需考虑数据质量。
https://arxiv.org/abs/2404.12183
The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
跨对象跟踪任务的新趋势是使用自然语言跟踪感兴趣的对象。然而,缺乏成对提示实例数据会阻碍其进展。为解决这个问题,我们提出了一个高质量但成本低的数据生成方法,基于Unreal Engine 5,并构建了一个名为Refer-UE-City的新基准数据集,主要包括路口监视视频的场景,详细描述了人和车辆的外观和行为。具体来说,它提供了14个视频,总共有714个表情,与Refer-KITTI数据集的规模相当。此外,我们提出了一个多层语义引导多对象框架MLS-Track,通过引入语义引导模块(SGM)和语义相关分支(SCB)来增强模型和文本之间的交互。对Refer-UE-City和Refer-KITTI数据集的实验表明,我们提出的框架的有效性得到了证明,并且实现了最先进的性能。代码和数据集将可用。
https://arxiv.org/abs/2404.12031
Motivated by the need to improve model performance in traffic monitoring tasks with limited labeled samples, we propose a straightforward augmentation technique tailored for object detection datasets, specifically designed for stationary camera-based applications. Our approach focuses on placing objects in the same positions as the originals to ensure its effectiveness. By applying in-place augmentation on objects from the same camera input image, we address the challenge of overlapping with original and previously selected objects. Through extensive testing on two traffic monitoring datasets, we illustrate the efficacy of our augmentation strategy in improving model performance, particularly in scenarios with limited labeled samples and imbalanced class distributions. Notably, our method achieves comparable performance to models trained on the entire dataset while utilizing only 8.5 percent of the original data. Moreover, we report significant improvements, with mAP@.5 increasing from 0.4798 to 0.5025, and the mAP@.5:.95 rising from 0.29 to 0.3138 on the FishEye8K dataset. These results highlight the potential of our augmentation approach in enhancing object detection models for traffic monitoring applications.
为了在有限标注样本的情况下提高交通监测任务的模型性能,我们提出了一个专门针对物体检测数据集的简单增强技术,尤其针对静止相机应用。我们的方法专注于将物体放置在原始位置相同的位置,以确保其有效性。通过在同一相机输入图像上的对象进行原地增强,我们解决了与原始和之前选择的对象重叠的挑战。在两个交通监测数据集上进行广泛的测试,我们证明了我们在增强策略上取得优异性能,特别是在有限标注样本和类别分布不均衡的场景中。值得注意的是,我们的方法在只使用原始数据的8.5%的情况下,实现了与整个数据集训练的模型相当的表现。此外,我们报道了显著的改进,其中mAP@.5从0.4798增加到0.5025,mAP@.5:.95从0.29增加到0.3138在FishEye8K数据集上。这些结果突出了我们在增强交通监测应用中的物体检测模型潜力。
https://arxiv.org/abs/2404.11226
Human pose estimation faces hurdles in real-world applications due to factors like lighting changes, occlusions, and cluttered environments. We introduce a unique RGB-Thermal Nearly Paired and Annotated 2D Pose Dataset, comprising over 2,400 high-quality LWIR (thermal) images. Each image is meticulously annotated with 2D human poses, offering a valuable resource for researchers and practitioners. This dataset, captured from seven actors performing diverse everyday activities like sitting, eating, and walking, facilitates pose estimation on occlusion and other challenging scenarios. We benchmark state-of-the-art pose estimation methods on the dataset to showcase its potential, establishing a strong baseline for future research. Our results demonstrate the dataset's effectiveness in promoting advancements in pose estimation for various applications, including surveillance, healthcare, and sports analytics. The dataset and code are available at this https URL
由于因素如光照变化、遮挡和杂乱环境,人体姿态估计在现实应用中面临挑战。我们引入了一个独特的RGB-Thermal Nearly Paired和注释的2D人体姿态数据集,包括超过2400张高质量的LWIR(热)图像。每张图像都精心注释了2D人体姿态,为研究者和技术人员提供了一个宝贵的资源。这个数据集从七名演员在多样日常活动(如坐、吃、走)中拍摄获取,有助于在遮挡和其他具有挑战性的场景中进行姿态估计。我们在数据集上对最先进的姿态估计方法进行基准,以展示其潜力,并为未来的研究建立了一个强大的基线。我们的结果表明,该数据集在推动各种应用中人体姿态估计的进步方面非常有效,包括监视、医疗和体育分析。数据集和代码都可以在以下链接中找到:https://www.链接
https://arxiv.org/abs/2404.10212
Traffic video description and analysis have received much attention recently due to the growing demand for efficient and reliable urban surveillance systems. Most existing methods only focus on locating traffic event segments, which severely lack descriptive details related to the behaviour and context of all the subjects of interest in the events. In this paper, we present TrafficVLM, a novel multi-modal dense video captioning model for vehicle ego camera view. TrafficVLM models traffic video events at different levels of analysis, both spatially and temporally, and generates long fine-grained descriptions for the vehicle and pedestrian at different phases of the event. We also propose a conditional component for TrafficVLM to control the generation outputs and a multi-task fine-tuning paradigm to enhance TrafficVLM's learning capability. Experiments show that TrafficVLM performs well on both vehicle and overhead camera views. Our solution achieved outstanding results in Track 2 of the AI City Challenge 2024, ranking us third in the challenge standings. Our code is publicly available at this https URL.
近年来,由于对高效且可靠的城郊监控系统需求的不断增加,交通视频描述和分析得到了广泛关注。目前,大多数现有方法仅关注于定位交通事件段,这严重缺乏与所有感兴趣对象的行为和上下文相关的详细描述。在本文中,我们提出了TrafficVLM,一种用于车辆自相机视场的多模态密集视频标注模型。TrafficVLM在不同的分析和空间水平上对交通视频事件进行建模,并生成不同事件阶段车辆和行人的详细描述。我们还提出了一种条件组件,用于控制TrafficVLM的生成输出,以及一种多任务微调范式,以增强TrafficVLM的学习能力。实验证明,TrafficVLM在车辆和 overhead 相机视图上表现出色。我们的解决方案在2024年AI城市挑战赛的第二部分(Track 2)中取得了突出成绩,排名第三。我们的代码公开可用,位于此链接:https://www.example.com。
https://arxiv.org/abs/2404.09275
Computer vision, particularly vehicle and pedestrian identification is critical to the evolution of autonomous driving, artificial intelligence, and video surveillance. Current traffic monitoring systems confront major difficulty in recognizing small objects and pedestrians effectively in real-time, posing a serious risk to public safety and contributing to traffic inefficiency. Recognizing these difficulties, our project focuses on the creation and validation of an advanced deep-learning framework capable of processing complex visual input for precise, real-time recognition of cars and people in a variety of environmental situations. On a dataset representing complicated urban settings, we trained and evaluated different versions of the YOLOv8 and RT-DETR models. The YOLOv8 Large version proved to be the most effective, especially in pedestrian recognition, with great precision and robustness. The results, which include Mean Average Precision and recall rates, demonstrate the model's ability to dramatically improve traffic monitoring and safety. This study makes an important addition to real-time, reliable detection in computer vision, establishing new benchmarks for traffic management systems.
计算机视觉,特别是车辆和行人识别,对自动驾驶、人工智能和视频监控技术的演变至关重要。目前,实时交通监测系统在实时识别小型物体和行人方面面临重大困难,这给公共安全和交通效率带来了严重威胁。为了克服这些困难,我们的项目专注于创建和验证一个先进的深度学习框架,能够处理复杂的视觉输入,精确、实时地识别各种环境中的车辆和行人。在一个复杂的都市数据集中,我们训练和评估了不同版本的YOLOv8和RT-DETR模型。YOLOv8大版本被证明是最有效的,特别是在行人识别方面,具有很高的精确度和鲁棒性。包括平均精度均值(Mean Average Precision,MAP)和召回率在内的结果表明,模型具有显著提高交通监测和安全的潜力。这项研究在实时、可靠的计算机视觉检测方面做出了重要的贡献,为交通管理系统树立了新的基准。
https://arxiv.org/abs/2404.08081
Multi-robot target tracking finds extensive applications in different scenarios, such as environmental surveillance and wildfire management, which require the robustness of the practical deployment of multi-robot systems in uncertain and dangerous environments. Traditional approaches often focus on the performance of tracking accuracy with no modeling and assumption of the environments, neglecting potential environmental hazards which result in system failures in real-world deployments. To address this challenge, we investigate multi-robot target tracking in the adversarial environment considering sensing and communication attacks with uncertainty. We design specific strategies to avoid different danger zones and proposed a multi-agent tracking framework under the perilous environment. We approximate the probabilistic constraints and formulate practical optimization strategies to address computational challenges efficiently. We evaluate the performance of our proposed methods in simulations to demonstrate the ability of robots to adjust their risk-aware behaviors under different levels of environmental uncertainty and risk confidence. The proposed method is further validated via real-world robot experiments where a team of drones successfully track dynamic ground robots while being risk-aware of the sensing and/or communication danger zones.
多机器人目标跟踪在不同的场景中具有广泛的应用,如环境监视和野火管理,这些场景需要多机器人系统在不确定和危险的环境中的实际部署的稳健性。传统的解决方案通常关注跟踪准确性的性能,没有对环境进行建模和假设,忽视了在现实部署中导致系统故障的潜在环境危险。为了应对这个挑战,我们研究在不确定环境中进行多机器人目标跟踪,考虑了感知和通信攻击的不确定性。我们设计了一些特定的策略来避开不同的危险区域,并针对危险环境提出了一种多代理跟踪框架。我们近似概率约束并提出了有效的优化策略来应对计算挑战。我们在仿真中评估了我们提出方法的性能,以证明机器人在不同环境不确定性和风险信心水平下调整其风险意识行为的能力。所提出的方法通过现实世界的机器人实验进一步验证,一支无人机团队在考虑感知和/或通信危险区域的情况下成功跟踪了动态地面机器人。
https://arxiv.org/abs/2404.07880
The illegal disposal of trash is a major public health and environmental concern. Disposing of trash in unplanned places poses serious health and environmental risks. We should try to restrict public trash cans as much as possible. This research focuses on automating the penalization of litterbugs, addressing the persistent problem of littering in public places. Traditional approaches relying on manual intervention and witness reporting suffer from delays, inaccuracies, and anonymity issues. To overcome these challenges, this paper proposes a fully automated system that utilizes surveillance cameras and advanced computer vision algorithms for litter detection, object tracking, and face recognition. The system accurately identifies and tracks individuals engaged in littering activities, attaches their identities through face recognition, and enables efficient enforcement of anti-littering policies. By reducing reliance on manual intervention, minimizing human error, and providing prompt identification, the proposed system offers significant advantages in addressing littering incidents. The primary contribution of this research lies in the implementation of the proposed system, leveraging advanced technologies to enhance surveillance operations and automate the penalization of litterbugs.
垃圾随意丢弃是一个重大的公共卫生和环境问题。在未经规划的地方丢弃垃圾会带来严重的健康和环境风险。我们应该尽可能地限制公共场所的垃圾桶。这项研究专注于自动惩处乱扔垃圾者,解决公共场所乱扔垃圾的长期问题。传统方法依赖于人工干预和目击报告,存在延迟、不准确和匿名性问题。为了克服这些挑战,本文提出了一个完全自动化的系统,利用摄像头和先进的计算机视觉算法进行垃圾检测、物体追踪和面部识别。该系统准确识别和跟踪进行乱扔垃圾活动的人,通过面部识别附上他们的身份,并能够有效地执行禁止乱扔垃圾政策。通过减少对人工干预的依赖,降低人为错误,并提供及时的身份识别,所提出的系统在解决乱扔垃圾事件方面具有显著优势。本研究的最大贡献在于实施所提出的系统,利用先进技术增强监视操作并自动惩处乱扔垃圾者。
https://arxiv.org/abs/2404.07467
Forecasting the short-term spread of an ongoing disease outbreak is a formidable challenge due to the complexity of contributing factors, some of which can be characterized through interlinked, multi-modality variables such as epidemiological time series data, viral biology, population demographics, and the intersection of public policy and human behavior. Existing forecasting model frameworks struggle with the multifaceted nature of relevant data and robust results translation, which hinders their performances and the provision of actionable insights for public health decision-makers. Our work introduces PandemicLLM, a novel framework with multi-modal Large Language Models (LLMs) that reformulates real-time forecasting of disease spread as a text reasoning problem, with the ability to incorporate real-time, complex, non-numerical information that previously unattainable in traditional forecasting models. This approach, through a unique AI-human cooperative prompt design and time series representation learning, encodes multi-modal data for LLMs. The model is applied to the COVID-19 pandemic, and trained to utilize textual public health policies, genomic surveillance, spatial, and epidemiological time series data, and is subsequently tested across all 50 states of the U.S. Empirically, PandemicLLM is shown to be a high-performing pandemic forecasting framework that effectively captures the impact of emerging variants and can provide timely and accurate predictions. The proposed PandemicLLM opens avenues for incorporating various pandemic-related data in heterogeneous formats and exhibits performance benefits over existing models. This study illuminates the potential of adapting LLMs and representation learning to enhance pandemic forecasting, illustrating how AI innovations can strengthen pandemic responses and crisis management in the future.
预测正在进行的疾病爆发的短期浮动是一个具有挑战性的任务,因为相关因素的复杂性,一些因素可以通过相互关联、多模态变量(如流行病学时间序列数据、病毒生物学、人口学、公共卫生与人类行为之间的交汇)进行特征化。现有的预测模型框架在处理相关数据的复杂性以及可靠结果的转化方面存在困难,这阻碍了它们的表现和对公共卫生决策者的实用性洞察力。我们的工作引入了PandemicLLM,一种新颖的多模态大型语言模型(LLM)框架,将疾病传播的实时预测重新建模为文本推理问题,具有将实时、复杂、非数值信息纳入传统预测模型的能力。通过独特的AI-人类合作提示设计和时间序列表示学习,为LLM编码了多模态数据。该模型应用于COVID-19大流行,并训练利用文本公共卫生政策、基因组监测、空间和流行病学时间序列数据,随后在所有50个州进行了测试。实证研究证明,PandemicLLM是一个高效的 pandemic forecasting 框架,有效捕捉了新兴变种的传播影响,并提供及时、准确的预测。所提出的PandemicLLM为将 various pandemic-related data以异构格式纳入模型以及展示现有模型的性能优势提供了道路。本研究阐明了将LLM和表示学习适应以提高 pandemic forecasting的潜力,表明 AI 创新可以在未来加强 pandemic 应对和危机管理。
https://arxiv.org/abs/2404.06962
This article presents a deep reinforcement learning-based approach to tackle a persistent surveillance mission requiring a single unmanned aerial vehicle initially stationed at a depot with fuel or time-of-flight constraints to repeatedly visit a set of targets with equal priority. Owing to the vehicle's fuel or time-of-flight constraints, the vehicle must be regularly refueled, or its battery must be recharged at the depot. The objective of the problem is to determine an optimal sequence of visits to the targets that minimizes the maximum time elapsed between successive visits to any target while ensuring that the vehicle never runs out of fuel or charge. We present a deep reinforcement learning algorithm to solve this problem and present the results of numerical experiments that corroborate the effectiveness of this approach in comparison with common-sense greedy heuristics.
这篇文章介绍了一种基于深度强化学习的解决方案来解决需要一个最初在仓库内停留的无人直升机进行持久监视任务的问题,该任务要求无人机多次以平等优先级访问一组目标。由于无人机的燃料或飞行时间约束,车辆必须定期充电或放电。问题的目标是确定一个最优的访问目标序列,使得每次连续访问之间的时间间隔最小,同时确保车辆不会因为缺乏燃料或电量而停止。我们提出了一个基于深度强化学习的算法来解决这个问题,并通过数值实验验证了这种方法与常见策略梯度的效果。
https://arxiv.org/abs/2404.06423
Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations; hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VIT excels at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data. The official website of this work is available at: this https URL.
视频中的人类动作或活动识别是一个在计算机视觉领域的基本任务,应用于监视和监控、自动驾驶汽车、运动分析、人机交互等领域。传统的监督方法需要大量的标注数据进行训练,这些数据耗费时间和金钱。本文提出了一种新颖的方法,使用交叉架构伪标签与对比学习进行半监督动作识别。我们的框架利用已标注和未标注数据来稳健地学习视频中的动作表示,将伪标签与对比学习相结合,实现从两种类型的样本中有效地学习动作表示。我们引入了一种新颖的跨架构方法,其中3D卷积神经网络(3D CNN)和视频转换器(VIT)被用于捕捉不同动作表示的各个方面;因此我们称之为ActNetFormer。3D CNN在捕捉时间域中的空间特征和局部依赖方面表现出色,而VIT在捕捉帧之间的长距离依赖方面表现出色。通过将这两种互补架构集成到ActNetFormer框架中,我们的方法可以有效地捕捉动作的局部和全局上下文信息。这种全面的表示学习使模型能够在半监督动作识别任务中实现更好的性能,并利用每个架构的优势。在标准的动作识别数据集上进行的实验结果表明,我们的方法的表现优于现有方法,只需要很少的标注数据就能实现最先进的表现。本文的工作可以在其官方网站上查看:https:// this URL。
https://arxiv.org/abs/2404.06243
This chapter explores the role of patent protection in algorithmic surveillance and whether ordre public exceptions from patentability should apply to such patents, due to their potential to enable human rights violations. It concludes that in most cases, it is undesirable to exclude algorithmic surveillance patents from patentability, as the patent system is ill-equipped to evaluate the impacts of the exploitation of such technologies. Furthermore, the disclosure of such patents has positive externalities from the societal perspective by opening the black box of surveillance for public scrutiny.
本章探讨了专利保护在算法监视中的作用,以及是否应该对具有这种潜力的专利适用公共权利豁免。它得出的结论是,在大多数情况下,将算法监视专利排除在专利保护之外是不合适的,因为专利制度本身尚无法评估此类技术对人类权利的侵犯影响。此外,这些专利的披露从社会角度来看具有积极的外部性,因为它们为公众审查监视打开了一个透明盒子。
https://arxiv.org/abs/2404.05534
We introduce Dynamic Distinction Learning (DDL) for Video Anomaly Detection, a novel video anomaly detection methodology that combines pseudo-anomalies, dynamic anomaly weighting, and a distinction loss function to improve detection accuracy. By training on pseudo-anomalies, our approach adapts to the variability of normal and anomalous behaviors without fixed anomaly thresholds. Our model showcases superior performance on the Ped2, Avenue and ShanghaiTech datasets, where individual models are tailored for each scene. These achievements highlight DDL's effectiveness in advancing anomaly detection, offering a scalable and adaptable solution for video surveillance challenges.
我们提出了动态区分学习(DDL)用于视频异常检测,一种新颖的视频异常检测方法,将伪异常、动态异常权重和区分损失函数相结合,以提高检测准确性。通过在伪异常上训练,我们的方法适应了正常和异常行为的变异性,而没有固定的异常阈值。在 Ped2、Avenue 和 ShanghaiTech 数据集上,我们的模型在个人场景上展现了卓越的性能。这些成就突出了 DDL 在推动异常检测方面的有效性,为视频监控挑战提供了一个可扩展和适应的解决方案。
https://arxiv.org/abs/2404.04986