Smart Video surveillance systems have become important recently for ensuring public safety and security, especially in smart cities. However, applying real-time artificial intelligence technologies combined with low-latency notification and alarming has made deploying these systems quite challenging. This paper presents a case study for designing and deploying smart video surveillance systems based on a real-world testbed at a community college. We primarily focus on a smart camera-based system that can identify suspicious/abnormal activities and alert the stakeholders and residents immediately. The paper highlights and addresses different algorithmic and system design challenges to guarantee real-time high-accuracy video analytics processing in the testbed. It also presents an example of cloud system infrastructure and a mobile application for real-time notification to keep students, faculty/staff, and responsible security personnel in the loop. At the same time, it covers the design decision to maintain communities' privacy and ethical requirements as well as hardware configuration and setups. We evaluate the system's performance using throughput and end-to-end latency. The experiment results show that, on average, our system's end-to-end latency to notify the end users in case of detecting suspicious objects is 5.3, 5.78, and 11.11 seconds when running 1, 4, and 8 cameras, respectively. On the other hand, in case of detecting anomalous behaviors, the system could notify the end users with 7.3, 7.63, and 20.78 seconds average latency. These results demonstrate that the system effectively detects and notifies abnormal behaviors and suspicious objects to the end users within a reasonable period. The system can run eight cameras simultaneously at a 32.41 Frame Per Second (FPS) rate.
智能视频监控系统最近对于确保公共安全和安保至关重要,尤其是在智慧城市中。然而,应用实时人工智能技术并与低延迟通知和警报相结合,使部署这些系统变得非常具有挑战性。本文介绍了基于一所社区大学的现实世界测试平台的设计和应用智能视频监控系统的案例分析。我们主要关注一个基于智能摄像机的系统,能够立即识别异常活动并通知 stakeholders和居民。本文重点强调并解决了不同算法和系统设计挑战,以确保测试平台上的实时高准确性视频分析处理。此外,本文还展示了云系统基础设施和实时通知的移动应用程序的例子,以保持学生、教师/工作人员和负责任的安全人员参与循环。同时,本文还涵盖了设计决定,以维护社区隐私和伦理要求,以及硬件配置和setup。我们使用吞吐量和端到端延迟来评估系统的性能。实验结果显示,在运行1、4和8个摄像机的情况下,如果检测到异常活动,系统的平均端到端延迟通知用户为5.3、5.78和11.11秒。另一方面,如果检测到异常行为,系统可以通知用户,平均延迟为7.3、7.63和20.78秒。这些结果表明,系统在合理的时间内有效地检测到并通知异常行为和可疑物品。系统可以同时运行8个摄像机,以每秒32.41帧的速度运行。
https://arxiv.org/abs/2303.12934
Video-based person re-identification (video re-ID) has lately fascinated growing attention due to its broad practical applications in various areas, such as surveillance, smart city, and public safety. Nevertheless, video re-ID is quite difficult and is an ongoing stage due to numerous uncertain challenges such as viewpoint, occlusion, pose variation, and uncertain video sequence, etc. In the last couple of years, deep learning on video re-ID has continuously achieved surprising results on public datasets, with various approaches being developed to handle diverse problems in video re-ID. Compared to image-based re-ID, video re-ID is much more challenging and complex. To encourage future research and challenges, this first comprehensive paper introduces a review of up-to-date advancements in deep learning approaches for video re-ID. It broadly covers three important aspects, including brief video re-ID methods with their limitations, major milestones with technical challenges, and architectural design. It offers comparative performance analysis on various available datasets, guidance to improve video re-ID with valuable thoughts, and exciting research directions.
视频身份识别(视频重配)最近吸引了越来越多的关注,因为它在许多领域都具有广泛的应用,例如监控、智慧城市和公共安全等。然而,视频重配仍然是一项相当困难的任务,并且仍然是一个持续的阶段,因为有许多不确定的挑战,例如视角、遮挡、姿态变化和不确定的视频序列等。在过去两年中,深度学习在视频重配方面已经取得了令人惊奇的结果,在公开数据集上开发了各种方法来处理各种视频重配问题。与基于图像的身份识别相比,视频重配更具挑战性和复杂性。为了鼓励未来的研究和挑战,本综述性论文介绍了关于视频重配深度学习方法的最新进展。它涵盖了三个重要的方面,包括简短的视频重配方法及其限制、具有技术挑战的主要里程碑和建筑设计。它提供了对各种可用数据集的比较性能分析、有价值的思想和改进视频重配的方法,并提出了令人兴奋的研究方向。
https://arxiv.org/abs/2303.11332
Unobtrusive monitoring of distances between people indoors is a useful tool in the fight against pandemics. A natural resource to accomplish this are surveillance cameras. Unlike previous distance estimation methods, we use a single, overhead, fisheye camera with wide area coverage and propose two approaches. One method leverages a geometric model of the fisheye lens, whereas the other method uses a neural network to predict the 3D-world distance from people-locations in a fisheye image. To evaluate our algorithms, we collected a first-of-its-kind dataset using single fisheye camera, that comprises a wide range of distances between people (1-58 ft) and will be made publicly available. The algorithms achieve 1-2 ft distance error and over 95% accuracy in detecting social-distance violations.
室内人员间的距离监测是对抗疫情的一种有用工具。实现这一目标的自然资源是监控摄像头。与以前的距离估计方法不同,我们使用一种具有广泛区域覆盖范围的 overhead 大视野显微镜,并提出两种方法。一种方法利用显微镜的几何模型,而另一种方法使用神经网络预测从显微镜图像中的人地点到三维世界的距離。为了评估我们的算法,我们使用单个显微镜相机收集了一个独特的数据集,该数据集包括人们之间的广泛距离范围(1-58英尺),并将公开发布。算法在检测社交距离违反行为方面实现了1-2英尺的距离误差,并超过95%的准确率。
https://arxiv.org/abs/2303.11520
AI-induced societal harms mirror existing problems in domains where AI replaces or complements traditional methodologies. However, trustworthy AI discourses postulate the homogeneity of AI, aim to derive common causes regarding the harms they generate, and demand uniform human interventions. Such AI monism has spurred legislation for omnibus AI laws requiring any high-risk AI systems to comply with a full, uniform package of rules on fairness, transparency, accountability, human oversight, accuracy, robustness, and security, as demonstrated by the EU AI Regulation and the U.S. draft Algorithmic Accountability Act. However, it is irrational to require high-risk or critical AIs to comply with all the safety, fairness, accountability, and privacy regulations when it is possible to separate AIs entailing safety risks, biases, infringements, and privacy problems. Legislators should gradually adapt existing regulations by categorizing AI systems according to the types of societal harms they induce. Accordingly, this paper proposes the following categorizations, subject to ongoing empirical reassessments. First, regarding intelligent agents, safety regulations must be adapted to address incremental accident risks arising from autonomous behavior. Second, regarding discriminative models, law must focus on the mitigation of allocative harms and the disclosure of marginal effects of immutable features. Third, for generative models, law should optimize developer liability for data mining and content generation, balancing potential social harms arising from infringing content and the negative impact of excessive filtering and identify cases where its non-human identity should be disclosed. Lastly, for cognitive models, data protection law should be adapted to effectively address privacy, surveillance, and security problems and facilitate governance built on public-private partnerships.
人工智能引起的社会危害类似于在人工智能取代或补充传统方法学的领域中存在的现有问题。然而,可信的人工智能论述主张人工智能的一致性,旨在从生成的危害中找到共同的原因,并要求人类统一干预。这种人工智能一元论推动了针对所有高风险人工智能系统的通用人工智能法律,要求它们遵守公平、透明、责任、人类监督、精度、强度和安全性的全面、统一规则,正如欧盟人工智能条例和美国起草的算法责任法案所证明的那样。然而,当存在分离可能导致安全风险、偏见、侵犯隐私和问题的情况下,要求高风险或关键人工智能系统遵守所有安全、公平、责任和隐私法规是不公平的。议员应该逐步适应现有的法规,按照它们所引起危害的类型进行分类。因此,本文提出了以下分类方案,但仍需要持续的实证评估。首先,关于智能实体,安全法规应该适应从自主行为中产生的增量事故风险。其次,关于歧视性模型,法律应该关注减少分配 harms 和披露不可变特征的边际效应。第三,对于生成模型,法律应该优化开发责任的数据挖掘和内容生成责任,平衡侵犯内容引起的潜在社会危害和过度过滤的负面影响,并识别其非人类身份应该公开的情况。最后,对于认知模型,数据保护法规应该适应有效地解决隐私、监视和安全问题,并基于公私营合作建立有效的治理。
https://arxiv.org/abs/2303.11196
Gun violence is a critical security problem, and it is imperative for the computer vision community to develop effective gun detection algorithms for real-world scenarios, particularly in Closed Circuit Television (CCTV) surveillance data. Despite significant progress in visual object detection, detecting guns in real-world CCTV images remains a challenging and under-explored task. Firearms, especially handguns, are typically very small in size, non-salient in appearance, and often severely occluded or indistinguishable from other small objects. Additionally, the lack of principled benchmarks and difficulty collecting relevant datasets further hinder algorithmic development. In this paper, we present a meticulously crafted and annotated benchmark, called \textbf{CCTV-Gun}, which addresses the challenges of detecting handguns in real-world CCTV images. Our contribution is three-fold. Firstly, we carefully select and analyze real-world CCTV images from three datasets, manually annotate handguns and their holders, and assign each image with relevant challenge factors such as blur and occlusion. Secondly, we propose a new cross-dataset evaluation protocol in addition to the standard intra-dataset protocol, which is vital for gun detection in practical settings. Finally, we comprehensively evaluate both classical and state-of-the-art object detection algorithms, providing an in-depth analysis of their generalizing abilities. The benchmark will facilitate further research and development on this topic and ultimately enhance security. Code, annotations, and trained models are available at this https URL.
枪支暴力是一个关键的安全问题,计算机视觉社区必须开发有效的枪支检测算法,尤其是在闭路电视监控数据中。尽管视觉对象检测取得了重大进展,但在真实世界中检测枪支仍然是一项挑战性和未被充分研究的任务。步枪,特别是手枪,通常非常小,外观不显著,常常严重遮挡或与其他小物体难以区分。此外,缺乏原则性的基准和收集相关数据集的困难和难度进一步阻碍了算法开发。在本文中,我们提出了一个精心制作并注释的基准,称为 \textbf{CCTV-gun},解决了在真实世界中检测手枪的挑战。我们的贡献有三个:首先,我们仔细选择和分析三个数据集中的现实世界闭路电视图像,手动注释手枪及其持有者,并将每个图像与相关的挑战因素,如模糊和遮挡等分配。其次,我们提出了一种新的跨数据集评估协议,除了标准内部数据集协议,这对在实际应用中检测枪支非常重要。最后,我们全面评估了经典和最先进的物体检测算法,提供了对其通用性能力的深入分析。基准将促进进一步研究和发展这个话题,并最终提高安全。代码、注释和训练模型可在这个 https URL 上获取。
https://arxiv.org/abs/2303.10703
Blind image quality assessment (BIQA) aims at automatically and accurately forecasting objective scores for visual signals, which has been widely used to monitor product and service quality in low-light applications, covering smartphone photography, video surveillance, autonomous driving, etc. Recent developments in this field are dominated by unimodal solutions inconsistent with human subjective rating patterns, where human visual perception is simultaneously reflected by multiple sensory information (e.g., sight and hearing). In this article, we present a unique blind multimodal quality assessment (BMQA) of low-light images from subjective evaluation to objective score. To investigate the multimodal mechanism, we first establish a multimodal low-light image quality (MLIQ) database with authentic low-light distortions, containing image and audio modality pairs. Further, we specially design the key modules of BMQA, considering multimodal quality representation, latent feature alignment and fusion, and hybrid self-supervised and supervised learning. Extensive experiments show that our BMQA yields state-of-the-art accuracy on the proposed MLIQ benchmark database. In particular, we also build an independent single-image modality Dark-4K database, which is used to verify its applicability and generalization performance in mainstream unimodal applications. Qualitative and quantitative results on Dark-4K show that BMQA achieves superior performance to existing BIQA approaches as long as a pre-trained quality semantic description model is provided. The proposed framework and two databases as well as the collected BIQA methods and evaluation metrics are made publicly available.
Blind image quality assessment (BIQA) 旨在自动和准确地预测视觉信号的主观评分,该方法被广泛应用于低光应用中的产品质量监控,包括智能手机摄影、视频监视、自动驾驶等。该领域最近的发展主要由单目解决方案与人类主观评价模式不一致的情况主导,人类视觉感知同时由多种感官信息(如视觉和听觉)同时反映。在本文中,我们介绍了一种独特的从主观评价到客观评分的全天候多模态质量评估(BMQA)方法,以研究多模态机制。为了研究多模态机制,我们首先建立了一个全天候低光图像质量(MLIQ)数据库,其中包含真实的低光扭曲,包含图像和音频模态对。此外,我们还特别设计了BMQA的关键模块,考虑多模态质量表示、潜在特征对齐和融合,以及混合自监督和监督学习。广泛的实验表明,我们的BMQA在提出的MLIQ基准数据库上表现出最先进的准确性。特别是,我们还建立了一个独立的单图像模态暗4K数据库,用于验证它在主流单模态应用中的适用性和泛化性能。暗4K数据库的定性和定量结果表明,只要提供预训练的质量语义描述模型,BMQA就能够实现与现有BIQA方法相比更好的性能。 proposed 框架和两个数据库,以及收集的BIQA方法和评估指标,均公开发布。
https://arxiv.org/abs/2303.10369
Person re-ID matches persons across multiple non-overlapping cameras. Despite the increasing deployment of airborne platforms in surveillance, current existing person re-ID benchmarks' focus is on ground-ground matching and very limited efforts on aerial-aerial matching. We propose a new benchmark dataset - AG-ReID, which performs person re-ID matching in a new setting: across aerial and ground cameras. Our dataset contains 21,983 images of 388 identities and 15 soft attributes for each identity. The data was collected by a UAV flying at altitudes between 15 to 45 meters and a ground-based CCTV camera on a university campus. Our dataset presents a novel elevated-viewpoint challenge for person re-ID due to the significant difference in person appearance across these cameras. We propose an explainable algorithm to guide the person re-ID model's training with soft attributes to address this challenge. Experiments demonstrate the efficacy of our method on the aerial-ground person re-ID task. The dataset will be published and the baseline codes will be open-sourced to facilitate research in this area.
人重发身份识别在不同角度的摄像头中进行匹配。尽管在监视领域的空中平台部署越来越多,但当前现有的人重发身份识别基准关注点是地面和空中的匹配,而且只有非常有限的努力涉及空中匹配。我们提出了一个新的基准数据集——AG-ReID,该数据集在一个新的环境中进行人重发身份识别匹配:在不同高度的空中和地面摄像头之间进行匹配。我们的数据集包含388个身份的21,983张照片和每个身份的15个软属性。这些数据由一架在15到45米高度飞行的无人机和在大学校园内的地面CCTV摄像头收集。我们的数据集由于这些摄像头之间的人外貌差异,提出了一种新的高度视角挑战,因此我们提出了一种可解释算法来指导人重发身份模型使用软属性进行训练,以解决这个挑战。实验表明,我们的方法对于空中地面人重发身份任务非常有效。数据集将发表,基准代码将开源,以促进这一领域的研究。
https://arxiv.org/abs/2303.08597
Human-centric perceptions include a variety of vision tasks, which have widespread industrial applications, including surveillance, autonomous driving, and the metaverse. It is desirable to have a general pretrain model for versatile human-centric downstream tasks. This paper forges ahead along this path from the aspects of both benchmark and pretraining methods. Specifically, we propose a \textbf{HumanBench} based on existing datasets to comprehensively evaluate on the common ground the generalization abilities of different pretraining methods on 19 datasets from 6 diverse downstream tasks, including person ReID, pose estimation, human parsing, pedestrian attribute recognition, pedestrian detection, and crowd counting. To learn both coarse-grained and fine-grained knowledge in human bodies, we further propose a \textbf{P}rojector \textbf{A}ssis\textbf{T}ed \textbf{H}ierarchical pretraining method (\textbf{PATH}) to learn diverse knowledge at different granularity levels. Comprehensive evaluations on HumanBench show that our PATH achieves new state-of-the-art results on 17 downstream datasets and on-par results on the other 2 datasets. The code will be publicly at \href{this https URL}{this https URL}.
人中心化感知包括多种视觉任务,具有广泛的工业应用,包括监控、自动驾驶和虚拟现实等。我们希望有一个通用的人类中心化后续任务预训练模型。本文从基准和预训练方法两个方面提出了一个 extbf{人bench},以综合评估不同预训练方法对19个不同后续任务数据集的泛化能力。这些任务包括人重识别、姿态估计、人类解析、行人属性识别、行人检测和人群计数。为了学习人体中的粗调和细粒度知识,我们进一步提出了一个 extbf{P}rojector extbf{A}ssis extbf{T}ed extbf{H}ierarchical extbf{Pre}training extbf{M}athon ( extbf{PATH}),以学习不同粒度级别的多种知识。对人bench的综合评估表明,我们的 PATH 在17个后续任务数据集上取得了新的先进技术结果,而在另外2个数据集上取得了与平均水平相当的结果。代码将公开在 href{this https URL}{this https URL}。
https://arxiv.org/abs/2303.05675
Current Active Speaker Detection (ASD) models achieve great results on AVA-ActiveSpeaker (AVA), using only sound and facial features. Although this approach is applicable in movie setups (AVA), it is not suited for less constrained conditions. To demonstrate this limitation, we propose a Wilder Active Speaker Detection (WASD) dataset, with increased difficulty by targeting the two key components of current ASD: audio and face. Grouped into 5 categories, ranging from optimal conditions to surveillance settings, WASD contains incremental challenges for ASD with tactical impairment of audio and face data. We select state-of-the-art models and assess their performance in two groups of WASD: Easy (cooperative settings) and Hard (audio and/or face are specifically degraded). The results show that: 1) AVA trained models maintain a state-of-the-art performance in WASD Easy group, while underperforming in the Hard one, showing the 2) similarity between AVA and Easy data; and 3) training in WASD does not improve models performance to AVA levels, particularly for audio impairment and surveillance settings. This shows that AVA does not prepare models for wild ASD and current approaches are subpar to deal with such conditions. The proposed dataset also contains body data annotations to provide a new source for ASD, and is available at this https URL.
目前 Active Speaker 检测 (ASD) 模型在 AVA-Active Speaker (AVA) 上取得了巨大的成果,仅使用声音和面部特征。尽管这种方法适用于电影搭建 (AVA) 环境,但并不适用于约束较少的情况。为了证明这一限制,我们提出了一个 Wilder Active Speaker 检测 (WASD) 数据集,通过增加难度来 targeting 目前的 ASD 的两个关键组件:音频和面部。将数据集分为 5 类,从最佳条件到监控设置,WASD 包含了对音频和面部数据的战术影响下的 ASD 增量挑战。我们选择最先进的模型,并在 WASD 的两个组中评估它们的性能:容易组(合作设置)和困难组(音频和/或面部专门恶化)。结果显示:1) AVA 训练的模型在 WASD 容易组中保持了最先进的性能,而在困难组中表现不佳,这表明了 AVA 和容易组数据之间的相似性;3) 在 WASD 中进行训练并未提高模型的性能到 AVA 水平,特别是针对音频影响和监控设置。这表明了 AVA 并未为野生 ASD 模型做好准备,当前的方法对于处理这种情况来说还不够好。该提议的数据集还包含身体数据注释,为 ASD 提供了一个新的来源,并可用在这个 https URL 上。
https://arxiv.org/abs/2303.05321
Reconstruction-based anomaly detection models achieve their purpose by suppressing the generalization ability for anomaly. However, diverse normal patterns are consequently not well reconstructed as well. Although some efforts have been made to alleviate this problem by modeling sample diversity, they suffer from shortcut learning due to undesired transmission of abnormal information. In this paper, to better handle the tradeoff problem, we propose Diversity-Measurable Anomaly Detection (DMAD) framework to enhance reconstruction diversity while avoid the undesired generalization on anomalies. To this end, we design Pyramid Deformation Module (PDM), which models diverse normals and measures the severity of anomaly by estimating multi-scale deformation fields from reconstructed reference to original input. Integrated with an information compression module, PDM essentially decouples deformation from prototypical embedding and makes the final anomaly score more reliable. Experimental results on both surveillance videos and industrial images demonstrate the effectiveness of our method. In addition, DMAD works equally well in front of contaminated data and anomaly-like normal samples.
基于重构的异常检测模型通过抑制异常泛化能力来实现其目的,但不同正常模式的重构结果并不良好。尽管已经通过建模样本多样性来缓解这个问题,但由于不希望传输异常信息而导致了快速学习。在本文中,为了更好地处理权衡问题,我们提出了多样性可测量的异常检测框架(DMAD),以增强重构多样性,同时避免对异常的不希望泛化。为此,我们设计了一个金字塔变形模块(PDM),该模块将不同的正常模式建模,并通过估计从重构参考到原始输入的多尺度变形场来估计异常的严重性。与信息压缩模块集成在一起,PDM实际上将变形与原型嵌入分离,从而使最终的异常得分更加可靠。在监控视频和工业图像的实验结果中,证明了我们的方法和DMAD的有效性。此外,DMAD在污染数据和类似异常的正常样本中同样有效。
https://arxiv.org/abs/2303.05047
Background subtraction is a fundamental task in computer vision with numerous real-world applications, ranging from object tracking to video surveillance. Dynamic backgrounds poses a significant challenge here. Supervised deep learning-based techniques are currently considered state-of-the-art for this task. However, these methods require pixel-wise ground-truth labels, which can be time-consuming and expensive. In this work, we propose a weakly supervised framework that can perform background subtraction without requiring per-pixel ground-truth labels. Our framework is trained on a moving object-free sequence of images and comprises two networks. The first network is an autoencoder that generates background images and prepares dynamic background images for training the second network. The dynamic background images are obtained by thresholding the background-subtracted images. The second network is a U-Net that uses the same object-free video for training and the dynamic background images as pixel-wise ground-truth labels. During the test phase, the input images are processed by the autoencoder and U-Net, which generate background and dynamic background images, respectively. The dynamic background image helps remove dynamic motion from the background-subtracted image, enabling us to obtain a foreground image that is free of dynamic artifacts. To demonstrate the effectiveness of our method, we conducted experiments on selected categories of the CDnet 2014 dataset and the I2R dataset. Our method outperformed all top-ranked unsupervised methods. We also achieved better results than one of the two existing weakly supervised methods, and our performance was similar to the other. Our proposed method is online, real-time, efficient, and requires minimal frame-level annotation, making it suitable for a wide range of real-world applications.
背景减除是计算机视觉中的一项基本任务,具有广泛的应用,包括对象跟踪和视频监视。动态背景在这里构成了一个重要的挑战。目前,监督深度学习技术被认为是完成这项任务的最先进的方法之一。然而,这些方法需要逐像素的真实标签,这个过程需要花费时间和金钱。在这项工作中,我们提出了一种弱监督框架,可以不需要逐像素的真实标签来实现背景减除。我们的框架是基于无对象图像序列的训练的,由两个网络组成。第一个网络是一个自编码器,生成背景图像并准备动态背景图像用于训练第二个网络。动态背景图像是通过阈值获取的。第二个网络是一个U-Net,使用相同的无对象视频进行训练,并使用动态背景图像作为逐像素的真实标签。在测试阶段,输入图像由自编码器和U-Net进行处理,生成背景和动态背景图像。动态背景图像有助于去除背景减除图像中的动态运动,使我们能够获得无动态失真的图像。为了证明我们方法的有效性,我们进行了CDnet 2014数据集和I2R数据集上 selected 类别的实验。我们的方法在所有排名最高的未监督方法中表现更好。我们还比其中一种现有的弱监督方法取得了更好的结果,我们的性能与另一种类似。我们提出的方法在线、实时高效,并只需要 minimal frame-level 标注,使其适用于广泛的实际应用程序。
https://arxiv.org/abs/2303.02857
Multi-human multi-robot teams have great potential for complex and large-scale tasks through the collaboration of humans and robots with diverse capabilities and expertise. To efficiently operate such highly heterogeneous teams and maximize team performance timely, sophisticated initial task allocation strategies that consider individual differences across team members and tasks are required. While existing works have shown promising results in reallocating tasks based on agent state and performance, the neglect of the inherent heterogeneity of the team hinders their effectiveness in realistic scenarios. In this paper, we present a novel formulation of the initial task allocation problem in multi-human multi-robot teams as contextual multi-attribute decision-make process and propose an attention-based deep reinforcement learning approach. We introduce a cross-attribute attention module to encode the latent and complex dependencies of multiple attributes in the state representation. We conduct a case study in a massive threat surveillance scenario and demonstrate the strengths of our model.
多人类多机器人团队通过人类和机器人拥有多种能力和专业知识的协作,可以执行复杂和大规模的任务,具有巨大的潜力。要高效地操作这样高度异质的团队,并尽可能快地最大化团队表现,需要采用 sophisticated initial task allocation strategies,其中需要考虑团队成员和任务之间的个体差异。虽然现有的工作在基于代理人状态和表现的任务分配方面已经取得了 promising 的结果,但忽略团队固有的异质性可能会阻碍它们在实际情境中的有效性。在本文中,我们提出了一种新的框架,将多人类多机器人团队初始任务分配问题视为上下文多属性决策过程,并提出了基于注意力的深度学习方法。我们介绍了跨属性注意力模块,以编码状态表示中多个属性的潜在和复杂的依赖关系。我们在一个大规模威胁监控场景下进行了案例研究,并展示了我们模型的优势。
https://arxiv.org/abs/2303.02486
Fast moving unmanned aerial vehicles (UAVs) are well suited for aerial surveillance, but are limited by their battery capacity. To increase their endurance UAVs can be refueled on slow moving unmanned ground vehicles (UGVs). The cooperative routing of UAV-UGV to survey vast regions within their speed and fuel constraints is a computationally challenging problem, but can be simplified with heuristics. Here we present multiple heuristics to enable feasible and sufficiently optimal solutions to the problem. Using the UAV fuel limits and the minimum set cover algorithm, the UGV refueling stops are determined. These refueling stops enable the allocation of mission points to the UAV and UGV. A standard traveling salesman formulation and a vehicle routing formulation with time windows, dropped visits, and capacity constraints is used to solve for the UGV and UAV route, respectively. Experimental validation of the approach on a small-scale testbed shows the efficacy of the approach.
快速移动的无人飞行器(UAVs)非常适合空中监视,但受限于电池容量。为了增加它们的耐力,UAVs可以在缓慢移动的无人地面飞行器(UGVs)上充电。在UAV-UGV的合作路由中,探索在它们的速度和燃料限制内的广阔地区是一项计算密集型问题,但可以使用启发式方法简化这个问题。在这里,我们介绍了多个启发式方法,以使其能够找到可行的和足够最优的解决方案。使用UAVs的燃料限制和最小覆盖算法,确定了UGV的充电点。这些充电点使能够将任务点分配给UAV和UGV。使用标准的旅行推销员方案和具有时间窗口、辍学访问和容量限制的车辆路由方案,分别解决了UGV和UAV的路径。在小规模的测试床上进行了实验验证,证明了该方法的有效性。
https://arxiv.org/abs/2303.02315
The use of Micro Aerial Vehicles (MAVs) for inspection and surveillance missions has proved to be extremely useful, however, their usability is negatively impacted by the large power requirements and the limited operating time. This work describes the design and development of a novel hybrid aerial-ground vehicle, enabling multi-modal mobility and long operating time, suitable for long-endurance inspection and monitoring applications. The design consists of a MAV with two tiltable axles and four independent passive wheels, allowing it to fly, approach, land and move on flat and inclined surfaces, while using the same set of actuators for all modes of locomotion. In comparison to existing multi-modal designs with passive wheels, the proposed design enables a higher ground locomotion efficiency, provides a higher payload capacity, and presents one of the lowest mass increases due to the ground actuation mechanism. The vehicle's performance is evaluated through a series of real experiments, demonstrating its flying, ground locomotion and wall-climbing capabilities, and the energy consumption for all modes of locomotion is evaluated.
Micro Aerial Vehicles (MAVs) 用于检查和监视任务已经证明是非常有用的,但是,它们的可用性受到大量能量需求和有限工作时间的负面影响。这项工作描述了一种新型的空中地面混合车辆的设计和发展,使其具有多种 mobility 和长时间运行能力,适用于长时间的检查和监测应用。设计包括一个具有两个可旋转轴和四个独立的被动轮子的 MAV,使其能够在平坦和斜面上飞行、接近、着陆和移动,同时使用相同的驱动元件用于所有运动模式。与现有的被动轮子的多媒设计相比, proposed 设计能够实现更高的地面运动效率、提供更高的载荷能力,并因为地面驱动机制而表现出最低的质量增加。车辆的性能通过一组实际实验进行评估,演示了它的飞行、地面运动和爬墙能力,并评估了所有运动模式的能量消耗。
https://arxiv.org/abs/2303.01933
It is important for infrastructure managers to maintain a high standard to ensure user satisfaction during a lifecycle of infrastructures. Surveillance cameras and visual inspections have enabled progress toward automating the detection of anomalous features and assessing the occurrence of the deterioration. Frequently, collecting damage data constraints time consuming and repeated inspections. One-class damage detection approach has a merit that only the normal images enables us to optimize the parameters. Simultaneously, the visual explanation using the heat map enable us to understand the localized anomalous feature. We propose a civil-purpose application to automate one-class damage detection using the fully-convolutional data description (FCDD). We also visualize the explanation of the damage feature using the up-sampling-based activation map with the Gaussian up-sampling from the receptive field of the fully convolutional network (FCN). We demonstrate it in experimental studies: concrete damage and steel corrosion and mention its usefulness and future works.
基础设施管理人员必须维持高标准,以确保用户在基础设施生命周期中的满意度。监控摄像头和视觉检查已经推动了自动化异常特征检测和评估恶化趋势的进展。通常,收集损坏数据会限制时间和重复检查。一型损坏检测方法的优点在于只有正常图像才能优化参数。同时,使用热图进行视觉解释能够让我们理解局部异常特征。我们提出了一种民用应用程序,使用全卷积数据描述(FCDD)自动化一型损坏检测。我们还使用基于分卷积神经网络的增强学习激活映射(GAN)从卷积神经网络的接收域中绘制增强学习映射,以可视化损坏特征的解释。我们在实验研究中展示了它的优点和未来的工作内容:混凝土破坏和钢材腐蚀,并提到了它的实用性和未来的研究方向。
https://arxiv.org/abs/2303.01732
Control and planning of multi-agent systems is an active and increasingly studied topic of research, with many practical applications such as rescue missions, security, surveillance, and transportation. This thesis addresses the planning and control of multi-agent systems under temporal logic tasks. The considered systems concern complex, robotic, manipulator-endowed systems, which can coordinate in order to execute complicated tasks, including object manipulation/transportation. Motivated by real-life scenarios, we take into account high-order dynamics subject to model uncertainties and unknown disturbances. Our approach is based on the integration of tools from the areas of multi-agent systems, intelligent control theory, cooperative object manipulation, discrete abstraction design of multi-agent-object systems, and formal verification. The first part of the thesis is devoted to the design of continuous control protocols for cooperative object manipulation/transportation by multiple robotic agents, and the relation of rigid cooperative manipulation schemes to multi-agent formation. In the second part of the thesis, we develop control schemes for the continuous coordination of multi-agent complex systems with uncertain dynamics, focusing on multi-agent navigation with collision specifications in obstacle-cluttered environments. The third part of the thesis is focused on the planning and control of multi-agent and multi-agent-object systems subject to complex tasks expressed as temporal logic formulas. The fourth and final part of the thesis focuses on several extension schemes for single-agent setups, such as motion planning under timed temporal tasks and asymptotic reference tracking for unknown systems while respecting funnel constraints.
多智能体系统的控制和规划是一个活跃且日益研究的研究领域,有许多实际应用,如救援任务、安全、监控和运输。本论文探讨了在时间逻辑任务下对多智能体系统的规划和控制。考虑该系统涉及复杂的机器人系统,具有控制台的能力,可以协调以执行复杂的任务,包括物体操纵/运输。基于现实生活中的场景,我们考虑了高阶动力学,受到模型不确定性和未知干扰的影响。我们的方法是将多智能体系统、智能控制理论、合作物体操纵、多智能体对象系统的离散抽象设计以及形式验证集成在一起。论文第一部分致力于设计连续的控制协议,以多个机器人代理的合作物体操纵/运输为例,并探讨了坚固的合作操纵方案与多智能体形成的关联。在论文第二部分中,我们开发了控制方案,以连续协调多智能体复杂系统,重点探讨了在障碍物众多的环境中,多智能体的导航与碰撞预定义。论文第三部分专注于规划和控制多智能体和多智能体-对象系统,以满足复杂的任务以时间逻辑公式表达。论文第四部分专注于对单个代理设置的一些扩展方案,例如在时间逻辑任务下的Motion Planning和未知系统的接近参考跟踪,同时尊重漏斗约束。
https://arxiv.org/abs/2303.01379
The automatic identification system (AIS) and video cameras have been widely exploited for vessel traffic surveillance in inland waterways. The AIS data could provide the vessel identity and dynamic information on vessel position and movements. In contrast, the video data could describe the visual appearances of moving vessels, but without knowing the information on identity, position and movements, etc. To further improve vessel traffic surveillance, it becomes necessary to fuse the AIS and video data to simultaneously capture the visual features, identity and dynamic information for the vessels of interest. However, traditional data fusion methods easily suffer from several potential limitations, e.g., asynchronous messages, missing data, random outliers, etc. In this work, we first extract the AIS- and video-based vessel trajectories, and then propose a deep learning-enabled asynchronous trajectory matching method (named DeepSORVF) to fuse the AIS-based vessel information with the corresponding visual targets. In addition, by combining the AIS- and video-based movement features, we also present a prior knowledge-driven anti-occlusion method to yield accurate and robust vessel tracking results under occlusion conditions. To validate the efficacy of our DeepSORVF, we have also constructed a new benchmark dataset (termed FVessel) for vessel detection, tracking, and data fusion. It consists of many videos and the corresponding AIS data collected in various weather conditions and locations. The experimental results have demonstrated that our method is capable of guaranteeing high-reliable data fusion and anti-occlusion vessel tracking.
自动识别系统(AIS)和摄像头已经被广泛应用于内陆水路中的船只流量监控。AIS数据可以提供船只身份和关于船只位置和移动的动态信息。相比之下,视频数据可以描述移动船只的视觉外观,但不知道身份、位置和移动等信息。为了进一步提高船只流量监控,必须将AIS和视频数据合并,同时捕捉感兴趣的船只的视觉特征、身份和动态信息。然而,传统的数据融合方法很容易受到多种潜在限制,例如异步消息、缺失数据、随机异常值等。在本文中,我们首先提取了基于AIS和视频的船只轨迹,然后提出了一种具有深度学习功能的异步轨迹匹配方法(名为DeepSORVF),以将基于AIS的船只信息与相应的视觉目标进行合并。此外,通过结合AIS和视频的运动特征,我们还提出了一种基于先验知识的动力排斥方法,以在遮挡条件下产生准确的、可靠的船只跟踪结果。为了验证我们的DeepSORVF的有效性,我们还建立了一个新的基准数据集(名为FVessel),用于船只检测、跟踪和数据融合。它包括在各种天气条件和位置收集的许多视频和相应的AIS数据。实验结果显示,我们的方法能够保证高可靠性的数据融合和排斥船只跟踪。
https://arxiv.org/abs/2302.11283
Real-time video surveillance, through CCTV camera systems has become essential for ensuring public safety which is a priority today. Although CCTV cameras help a lot in increasing security, these systems require constant human interaction and monitoring. To eradicate this issue, intelligent surveillance systems can be built using deep learning video classification techniques that can help us automate surveillance systems to detect violence as it happens. In this research, we explore deep learning video classification techniques to detect violence as they are happening. Traditional image classification techniques fall short when it comes to classifying videos as they attempt to classify each frame separately for which the predictions start to flicker. Therefore, many researchers are coming up with video classification techniques that consider spatiotemporal features while classifying. However, deploying these deep learning models with methods such as skeleton points obtained through pose estimation and optical flow obtained through depth sensors, are not always practical in an IoT environment. Although these techniques ensure a higher accuracy score, they are computationally heavier. Keeping these constraints in mind, we experimented with various video classification and action recognition techniques such as ConvLSTM, LRCN (with both custom CNN layers and VGG-16 as feature extractor) CNNTransformer and C3D. We achieved a test accuracy of 80% on ConvLSTM, 83.33% on CNN-BiLSTM, 70% on VGG16-BiLstm ,76.76% on CNN-Transformer and 80% on C3D.
实时视频监控通过CCTV camera系统已经成为确保公共安全的重要措施,而这一措施在当今优先级非常高。尽管CCTV摄像头在增加安全性方面做了很多工作,但这些系统需要不断的人类交互和监测。为了解决这个问题,我们可以使用深度学习视频分类技术来自动化监控系统,以便在发生时检测暴力行为。在这项研究中,我们探索了深度学习视频分类技术来检测正在发生的暴力行为。传统的图像分类技术在分类视频时存在一定的局限性,因为它们试图分别对待每个帧进行分类,这会导致预测开始闪烁。因此,许多研究人员正在开发视频分类技术,考虑时间和空间特征的同时分类。然而,在IoT环境中部署这些深度学习模型,如通过姿态估计获取骨骼点和控制深度传感器获取的光学流的方法,并不总是实际可行的。尽管这些方法可以保证更高的准确率,但它们的计算量相对较大。考虑到这些限制,我们尝试了各种视频分类和动作识别技术,如ConvLSTM、LRCN(同时使用自定义CNN层和VGG-16作为特征提取器)、CNNTransformer和C3D。我们在ConvLSTM上实现了80%的测试准确率,在CNN-BiLSTM上达到了83.33%,在VGG16-BiLstm上达到了70%,在CNNTransformer上达到了76.76%,在C3D上达到了80%。
https://arxiv.org/abs/2302.11027
Interest in automatic people re-identification systems has significantly grown in recent years, mainly for developing surveillance and smart shops software. Due to the variability in person posture, different lighting conditions, and occluded scenarios, together with the poor quality of the images obtained by different cameras, it is currently an unsolved problem. In machine learning-based computer vision applications with reduced data sets, one possibility to improve the performance of re-identification system is through the augmentation of the set of images or videos available for training the neural models. Currently, one of the most robust ways to generate synthetic information for data augmentation, whether it is video, images or text, are the generative adversarial networks. This article reviews the most relevant recent approaches to improve the performance of person re-identification models through data augmentation, using generative adversarial networks. We focus on three categories of data augmentation approaches: style transfer, pose transfer, and random generation.
自动人重定向系统的兴趣在近年来得到了显著增长,主要是为了开发监控和智能商店软件。由于人的姿态、不同照明条件和遮挡场景的变量,以及不同相机获取的图像质量不佳,目前这是一个无法解决的问题。在减少数据集的机器学习计算机视觉应用中,一种改善重定向系统性能的可能方法是增加训练神经网络的图像或视频数量。目前,生成对抗网络是生成合成信息中最稳健的方法之一,无论是视频、图像或文本,它们都是一种。本文综述了最近有关利用生成对抗网络提高人重定向模型性能的最受关注的方法。我们重点关注三个数据增强方法的分类:风格迁移、姿态迁移和随机生成。
https://arxiv.org/abs/2302.09119
The advent of the Edge Computing (EC) leads to a huge ecosystem where numerous nodes can interact with data collection devices located close to end users. Human detection and tracking can be realized at edge nodes that perform the surveillance of an area under consideration through the assistance of a set of sensors (e.g., cameras). Our target is to incorporate the discussed functionalities to embedded devices present at the edge keeping their size limited while increasing their processing capabilities. In this paper, we propose two models for human detection accompanied by algorithms for tracing the corresponding trajectories. We provide the description of the proposed models and extend them to meet the challenges of the problem. Our evaluation aims at identifying models' accuracy while presenting their requirements to have them executed in embedded devices.
边缘计算的出现导致了一个巨大的生态系统,其中许多节点可以与靠近最终用户的数据采集设备相互作用。人类检测和跟踪可以在边缘节点实现,通过一组传感器(例如摄像头)协助,对所考虑的区域进行监控。我们的的目标是将讨论的功能性添加到位于边缘的嵌入式设备中,限制其大小,同时提高其处理能力。在本文中,我们提出了两个人类检测模型,并介绍了它们的算法,以追踪相应的轨迹。我们描述了提出的模型,并将它们扩展到了解决该问题的挑战。我们的评估目标是确定模型的准确性,同时呈现它们在嵌入式设备中的需要。
https://arxiv.org/abs/2303.11170