Most studies in swarm robotics treat the swarm as an isolated system of interest. We argue that the prevailing view of swarms as self-sufficient, independent systems limits the scope of potential applications for swarm robotics. A robot swarm could act as a support in an heterogeneous system comprising other robots and/or human operators, in particular by quickly providing access to a large amount of data acquired in large unknown environments. Tasks such as target identification & tracking, scouting, or monitoring/surveillance could benefit from this approach.
大多数关于群机器人研究将群落视为一个独立系统。我们认为,将群落视为自给自足、独立系统的主流观点限制了群落机器人技术的潜在应用范围。一个机器人群落可以在包含其他机器人以及/或人类操作员的异质系统中充当支持,特别是通过迅速提供在大型未知环境中获取的大量数据的访问。诸如目标识别和追踪、侦察或监控等任务都可以从这种方法中受益。
https://arxiv.org/abs/2405.04079
Automatic personality trait assessment is essential for high-quality human-machine interactions. Systems capable of human behavior analysis could be used for self-driving cars, medical research, and surveillance, among many others. We present a multimodal deep neural network with a Siamese extension for apparent personality trait prediction trained on short video recordings and exploiting modality invariant embeddings. Acoustic, visual, and textual information are utilized to reach high-performance solutions in this task. Due to the highly centralized target distribution of the analyzed dataset, the changes in the third digit are relevant. Our proposed method addresses the challenge of under-represented extreme values, achieves 0.0033 MAE average improvement, and shows a clear advantage over the baseline multimodal DNN without the introduced module.
自动个性特质评估对于高质量的人机交互至关重要。具有人类行为分析能力的系统可以用于自动驾驶汽车、医学研究和监视等领域。我们提出了一个具有Siamese扩展的多模态深度神经网络,用于基于短视频录制的表象个性特质预测,并利用模态不变嵌入。我们利用音频、视觉和文本信息来达到这一任务的高性能解决方案。由于分析数据集的集中目标分布,第三位数字的变革至关重要。我们提出的方法解决了代表性不足的极端值挑战,实现了0.0033 MAE平均改进,并表明与引入模块的基线多模态DNN相比具有明显优势。
https://arxiv.org/abs/2405.03846
We developed a deep learning classifier of rectal cancer response (tumor vs. no-tumor) to total neoadjuvant treatment (TNT) from endoscopic images acquired before, during, and following TNT. We further evaluated the network's ability in a near out-of-distribution (OOD) problem to identify local regrowth (LR) from follow-up endoscopy images acquired several months to years after completing TNT. We addressed endoscopic image variability by using optimal mass transport-based image harmonization. We evaluated multiple training regularization schemes to study the ResNet-50 network's in-distribution and near-OOD generalization ability. Test time augmentation resulted in the most considerable accuracy improvement. Image harmonization resulted in slight accuracy improvement for the near-OOD cases. Our results suggest that off-the-shelf deep learning classifiers can detect rectal cancer from endoscopic images at various stages of therapy for surveillance.
我们开发了一个用于直肠癌反应分层的深度学习分类器(肿瘤与无肿瘤)来预测从结镜图像中获取的初始、进行中和和完成中和治疗的肿瘤反应(TNT)。我们还评估了网络在近分布外(OOD)问题中识别局部生长(LR)的能力,从完成TNT数月到数年后的随访结镜图像中。我们通过使用最优质量传输图像和谐来解决结镜图像变异性。我们评估了多个训练正则化方案,以研究ResNet-50网络在分布内和近分布外的泛化能力。测试时间增强导致准确性提高最为显著。图像和谐在近分布外案例中略微提高了一些准确性。我们的结果表明,标准的深度学习分类器可以从结镜图像中检测直肠癌的不同治疗阶段。
https://arxiv.org/abs/2405.03762
Due to the ever-increasing availability of video surveillance cameras and the growing need for crime prevention, the violence detection task is attracting greater attention from the research community. With respect to other action recognition tasks, violence detection in surveillance videos shows additional issues, such as the presence of a significant variety of real fight scenes. Unfortunately, available datasets seem to be very small compared with other action recognition datasets. Moreover, in surveillance applications, people in the scenes always differ for each video and the background of the footage differs for each camera. Also, violent actions in real-life surveillance videos must be detected quickly to prevent unwanted consequences, thus models would definitely benefit from a reduction in memory usage and computational costs. Such problems make classical action recognition methods difficult to be adopted. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model receives two spatiotemporal video streams, i.e., RGB frames and optical flows, and involves a new regularized self-supervised learning approach for videos. JOSENet provides improved performance compared to self-supervised state-of-the-art methods, while requiring one-fourth of the number of frames per video segment and a reduced frame rate. The source code and the instructions to reproduce our experiments are available at this https URL.
由于视频监控摄像头的不断增加和犯罪预防的需求不断增加,暴力检测任务正在从研究社区获得越来越多的关注。与其他动作识别任务相比,监控视频中的暴力检测任务还表现为其他问题,例如存在显著的实战场景。然而,现有的数据集似乎与其他动作识别数据集相比非常小。此外,在视频应用中,每个视频场景的人都有所不同,视频拍摄的角度也有所不同。为了快速检测现实生活中的暴力行为,防止不良后果,因此模型可以从内存使用和计算成本的降低中获得好处。这些问题使得经典动作识别方法难以采用。为解决这些问题,我们引入了JOSENet,一种新颖的自监督框架,可以在监控视频中的暴力检测方面提供卓越的表现。与自监督先进的视频分类方法相比,所提出的模型具有更好的性能,同时每个视频片段需要四分之一帧数和降低帧率。JOSENet的源代码和重现实验的说明可以在该链接处找到。
https://arxiv.org/abs/2405.02961
Rapid advancements of deep learning are accelerating adoption in a wide variety of applications, including safety-critical applications such as self-driving vehicles, drones, robots, and surveillance systems. These advancements include applying variations of sophisticated techniques that improve the performance of models. However, such models are not immune to adversarial manipulations, which can cause the system to misbehave and remain unnoticed by experts. The frequency of modifications to existing deep learning models necessitates thorough analysis to determine the impact on models' robustness. In this work, we present an experimental evaluation of the effects of model modifications on deep learning model robustness using adversarial attacks. Our methodology involves examining the robustness of variations of models against various adversarial attacks. By conducting our experiments, we aim to shed light on the critical issue of maintaining the reliability and safety of deep learning models in safety- and security-critical applications. Our results indicate the pressing demand for an in-depth assessment of the effects of model changes on the robustness of models.
深度学习的快速发展在各种应用中加速了其采用,包括自动驾驶车辆、无人机、机器人和监控系统等安全关键应用。这些进步包括应用复杂的技巧来提高模型的性能。然而,这些模型并非免受对抗性操纵的影响,这可能导致系统表现异常,并让专家无法察觉。对现有深度学习模型的修改频率表明,需要对模型的一致性进行深入分析,以确定其对模型鲁棒性的影响。在这项工作中,我们通过使用对抗攻击来评估模型修改对深度学习模型鲁棒性的影响。我们的方法包括研究模型修改对各种对抗攻击的鲁棒性。通过进行我们的实验,我们希望阐明在安全性和安全性关键应用中保持深度学习模型可靠性和安全性的迫切需求。我们的结果表明,对模型更改对模型鲁棒性的影响进行深入评估的需求非常紧迫。
https://arxiv.org/abs/2405.01934
The quest for robust Person re-identification (Re-ID) systems capable of accurately identifying subjects across diverse scenarios remains a formidable challenge in surveillance and security applications. This study presents a novel methodology that significantly enhances Person Re-Identification (Re-ID) by integrating Uncertainty Feature Fusion (UFFM) with Wise Distance Aggregation (WDA). Tested on benchmark datasets - Market-1501, DukeMTMC-ReID, and MSMT17 - our approach demonstrates substantial improvements in Rank-1 accuracy and mean Average Precision (mAP). Specifically, UFFM capitalizes on the power of feature synthesis from multiple images to overcome the limitations imposed by the variability of subject appearances across different views. WDA further refines the process by intelligently aggregating similarity metrics, thereby enhancing the system's ability to discern subtle but critical differences between subjects. The empirical results affirm the superiority of our method over existing approaches, achieving new performance benchmarks across all evaluated datasets. Code is available on Github.
寻找在多样场景中准确识别主题的稳健Person识别(Re-ID)系统仍然是一项艰巨的挑战,尤其是在监视和安全性应用中。本研究介绍了一种通过将不确定度特征融合(UFFM)与智能距离聚合(WDA)相结合来显著增强Person Re-Identification(Re-ID)的新方法。在基准数据集- Market-1501、DukeMTMC-ReID和MSMT17上进行了测试,我们的方法在排名1准确性和平均精度(mAP)方面取得了显著改进。具体来说,UFFM利用多个图像的特征合成能力克服了在不同视角下主题外观变异性所施加的局限性。WDA通过智能聚合相似度度量进一步优化了过程,从而增强了系统在识别主题间微小但关键差异的能力。实证结果证实了我们的方法优越于现有方法,在所有评估数据集上都实现了新的性能基准。代码可在Github上获取。
https://arxiv.org/abs/2405.01101
Video anomaly understanding (VAU) aims to automatically comprehend unusual occurrences in videos, thereby enabling various applications such as traffic surveillance and industrial manufacturing. While existing VAU benchmarks primarily concentrate on anomaly detection and localization, our focus is on more practicality, prompting us to raise the following crucial questions: "what anomaly occurred?", "why did it happen?", and "how severe is this abnormal event?". In pursuit of these answers, we present a comprehensive benchmark for Causation Understanding of Video Anomaly (CUVA). Specifically, each instance of the proposed benchmark involves three sets of human annotations to indicate the "what", "why" and "how" of an anomaly, including 1) anomaly type, start and end times, and event descriptions, 2) natural language explanations for the cause of an anomaly, and 3) free text reflecting the effect of the abnormality. In addition, we also introduce MMEval, a novel evaluation metric designed to better align with human preferences for CUVA, facilitating the measurement of existing LLMs in comprehending the underlying cause and corresponding effect of video anomalies. Finally, we propose a novel prompt-based method that can serve as a baseline approach for the challenging CUVA. We conduct extensive experiments to show the superiority of our evaluation metric and the prompt-based approach. Our code and dataset are available at this https URL.
视频异常理解(VAU)旨在通过自动理解视频中的异常情况,从而实现诸如交通监控和工业制造等各种应用。虽然现有的VAU基准主要集中在异常检测和定位,但我们的重点在于实用性,促使我们提出以下关键问题:"发生了什么异常?" "为什么会发生?" 和 "这种异常事件的严重程度有多大?" 为了解决这些问题,我们提出了一个全面的视频异常理解(CUVA)基准。具体来说,每个实例的基准包括三组人类注释,以表示异常的“什么”、“为什么”和“如何”,包括 1) 异常类型、开始和结束时间以及事件描述,2) 异常的自然语言解释,以及 3) 自由文本,反映异常的影响。此外,我们还引入了MMEval,一种专门针对CUVA设计的评估指标,有助于更好地与人类偏好对齐,从而测量现有LLM在理解视频异常的潜在原因和相应影响。最后,我们提出了一个基于提示的方法,可以作为具有挑战性的CUVA的基准方法。我们进行了广泛的实验,证明了我们的评估指标和基于提示的方法的优越性。我们的代码和数据集可在此处访问:https://www.acm.org/dl/doi/10.1145/2848206.2848615
https://arxiv.org/abs/2405.00181
The Palácio do Planalto, office of the President of Brazil, was invaded by protesters on January 8, 2023. Surveillance videos taken from inside the building were subsequently released by the Brazilian Supreme Court for public scrutiny. We used segments of such footage to create the UFPR-Planalto801 dataset for people tracking and re-identification in a real-world scenario. This dataset consists of more than 500,000 images. This paper presents a tracking approach targeting this dataset. The method proposed in this paper relies on the use of known state-of-the-art trackers combined in a multilevel hierarchy to correct the ID association over the trajectories. We evaluated our method using IDF1, MOTA, MOTP and HOTA metrics. The results show improvements for every tracker used in the experiments, with IDF1 score increasing by a margin up to 9.5%.
帕拉蒂宫(Planalto Palace)是巴西总统的办公室,2023年1月8日,抗议者入侵了这座建筑。巴西最高法院随后发布了大楼内的监控视频,供公众审查。我们使用这些视频片段创建了用于人员在现实场景中跟踪和识别的UFPR-Planalto801数据集。这个数据集包含超过500,000张图片。本文介绍了针对这个数据集的跟踪方法。这种方法依赖于在多层结构中使用已知的最先进跟踪器来纠正轨迹上的ID关联。我们使用IDF1、MOTA、MOTP和HOTA指标评估我们的方法。实验结果表明,使用的每个跟踪器都取得了改进,IDF1得分甚至提高了9.5%。
https://arxiv.org/abs/2404.18876
In this paper, we present a different way to use two modalities, in which either one modality or the other is seen by a single model. This can be useful when adapting an unimodal model to leverage more information while respecting a limited computational budget. This would mean having a single model that is able to deal with any modalities. To describe this, we coined the term anymodal learning. An example of this, is a use case where, surveillance in a room when the lights are off would be much more valuable using an infrared modality while a visible one would provide more discriminative information when lights are on. This work investigates how to efficiently leverage visible and infrared/thermal modalities for transformer-based object detection backbone to create an anymodal architecture. Our work does not create any inference overhead during the testing while exploring an effective way to exploit the two modalities during the training. To accomplish such a task, we introduce the novel anymodal training technique: Mixed Patches (MiPa), in conjunction with a patch-wise domain agnostic module, which is responsible of learning the best way to find a common representation of both modalities. This approach proves to be able to balance modalities by reaching competitive results on individual modality benchmarks with the alternative of using an unimodal architecture on three different visible-infrared object detection datasets. Finally, our proposed method, when used as a regularization for the strongest modality, can beat the performance of multimodal fusion methods while only requiring a single modality during inference. Notably, MiPa became the state-of-the-art on the LLVIP visible/infrared benchmark. Code: this https URL
在本文中,我们提出了另一种使用两种方式的方法,其中一种方式是让一个模型看到一种模式,而另一种方式是让一个模型看到另一种模式。当适应一个单一模态的模型以利用更多的信息,同时遵守有限计算预算时,这种方法可以很有用。这意味着要有一个能够处理任何模态的模型。为了描述这一点,我们定义了一个术语:多模态学习。一个这种多模态学习的例子是在房间里有灯光熄灭时进行监控,使用红外模式会比使用可见模式提供更有价值的监控信息,而灯光打开时,可见模式会提供更有区分性的信息。本文研究了如何有效地将可见和红外/热模态用于基于Transformer的对象检测骨干网络以创建多模态架构。我们的工作在测试过程中没有产生任何推理开销,同时探索了在训练过程中有效利用两种模态的最佳方法。为了实现这一目标,我们引入了新的多模态训练技术:Mixed Patches(MiPa),并与一个补丁域无关的模块相结合,该模块负责学习找到两种模态之间共同表示的最佳方式。这种方法通过在单个模态基准上实现竞争性的结果,证明了能够平衡模态,同时使用三个不同的可见-红外物体检测数据集上的单一模态架构。最后,当我们将该方法用作最强的模态的正则化时,可以在仅需要一个模态的情况下击败多模态融合方法的性能。值得注意的是,MiPa在LLVIP可见/红外基准上达到了最先进的水平。代码:https:// this URL
https://arxiv.org/abs/2404.18849
Video Anomaly Detection (VAD) identifies unusual activities in video streams, a key technology with broad applications ranging from surveillance to healthcare. Tackling VAD in real-life settings poses significant challenges due to the dynamic nature of human actions, environmental variations, and domain shifts. Many research initiatives neglect these complexities, often concentrating on traditional testing methods that fail to account for performance on unseen datasets, creating a gap between theoretical models and their real-world utility. Online learning is a potential strategy to mitigate this issue by allowing models to adapt to new information continuously. This paper assesses how well current VAD algorithms can adjust to real-life conditions through an online learning framework, particularly those based on pose analysis, for their efficiency and privacy advantages. Our proposed framework enables continuous model updates with streaming data from novel environments, thus mirroring actual world challenges and evaluating the models' ability to adapt in real-time while maintaining accuracy. We investigate three state-of-the-art models in this setting, focusing on their adaptability across different domains. Our findings indicate that, even under the most challenging conditions, our online learning approach allows a model to preserve 89.39% of its original effectiveness compared to its offline-trained counterpart in a specific target domain.
视频异常检测(VAD)是通过识别视频流中的异常活动来确定异常行为的视频技术,这种技术在从监视到医疗保健等广泛的应用领域具有重要作用。在现实环境中解决VAD问题面临着显著的挑战,因为人类行为的动态性、环境变化和领域转移的复杂性。许多研究倡议忽略了这些复杂性,通常集中于传统的测试方法,这些方法无法考虑未见数据上的性能,从而在理论和实际应用之间存在差距。在线学习是一种可能的策略,通过允许模型持续适应新信息,来缓解这个问题,特别是那些基于姿态分析的模型,以提高其效率和隐私优势。 本文通过一个在线学习框架评估了现有VAD算法在现实环境中的适应能力,特别是那些基于姿态分析的模型,以评估它们在保持准确性的同时,对新型环境进行持续更新的能力。我们研究了这种设置中三种最先进的模型,重点关注它们在不同领域中的适应性。我们的研究结果表明,即使在最具有挑战性的条件下,我们的在线学习方法也能使模型保留其原始有效性的89.39%,与它在离线训练时的对应模型相比。
https://arxiv.org/abs/2404.18747
Text-based person search (TBPS) aims to retrieve images of a specific person from a large image gallery based on a natural language description. Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning. It poses a significant challenge in practice, as acquiring person images from surveillance videos is relatively easy, while obtaining annotated texts is challenging. The paper undertakes a pioneering initiative to explore TBPS under the semi-supervised setting, where only a limited number of person images are annotated with textual descriptions while the majority of images lack annotations. We present a two-stage basic solution based on generation-then-retrieval for semi-supervised TBPS. The generation stage enriches annotated data by applying an image captioning model to generate pseudo-texts for unannotated images. Later, the retrieval stage performs fully-supervised retrieval learning using the augmented data. Significantly, considering the noise interference of the pseudo-texts on retrieval learning, we propose a noise-robust retrieval framework that enhances the ability of the retrieval model to handle noisy data. The framework integrates two key strategies: Hybrid Patch-Channel Masking (PC-Mask) to refine the model architecture, and Noise-Guided Progressive Training (NP-Train) to enhance the training process. PC-Mask performs masking on the input data at both the patch-level and the channel-level to prevent overfitting noisy supervision. NP-Train introduces a progressive training schedule based on the noise level of pseudo-texts to facilitate noise-robust learning. Extensive experiments on multiple TBPS benchmarks show that the proposed framework achieves promising performance under the semi-supervised setting.
基于文本的人 search (TBPS) 旨在从大量图像库中根据自然语言描述检索特定的人 images。现有的方法依赖于大规模带有注释的图像-文本数据来实现令人满意的半监督学习性能。在实践中,获取来自监视视频的人图像相对容易,而获取注释文本相对困难。本文致力于在半监督设置中探索 TBPS,其中只有少数人图像带有文本描述,而大部分图像都没有注释。我们提出了一个基于生成-然后-检索的两阶段基本解决方案。生成阶段通过应用图像描述模型生成未注释图像的伪文本。后来,检索阶段使用增强数据执行半监督检索学习。考虑到伪文本在检索学习中的噪声干扰,我们提出了一个噪音抗性的检索框架,增强了检索模型的处理噪音数据的能力。该框架集成了两个关键策略:混合补丁级通道掩码(PC-Mask)来优化模型架构,以及噪音引导的逐步训练(NP-Train)来增强训练过程。PC-Mask 在输入数据的补丁级别和通道级别进行遮蔽,以防止过拟合噪音监督。NP-Train 根据伪文本的噪音水平引入了逐步训练计划,以促进噪音抗性学习。在多个 TBPS 基准测试上进行的大量实验证明,在半监督设置下,所提出的框架取得了良好的性能。
https://arxiv.org/abs/2404.18106
In this paper we introduce CUE-Net, a novel architecture designed for automated violence detection in video surveillance. As surveillance systems become more prevalent due to technological advances and decreasing costs, the challenge of efficiently monitoring vast amounts of video data has intensified. CUE-Net addresses this challenge by combining spatial Cropping with an enhanced version of the UniformerV2 architecture, integrating convolutional and self-attention mechanisms alongside a novel Modified Efficient Additive Attention mechanism (which reduces the quadratic time complexity of self-attention) to effectively and efficiently identify violent activities. This approach aims to overcome traditional challenges such as capturing distant or partially obscured subjects within video frames. By focusing on both local and global spatiotemporal features, CUE-Net achieves state-of-the-art performance on the RWF-2000 and RLVS datasets, surpassing existing methods.
在本文中,我们提出了CUE-Net,一种专为视频监控自动暴力检测而设计的全新架构。由于技术的进步和成本降低,视频监控系统的普及程度不断加剧。CUE-Net通过将空间裁剪与增强版的UniformerV2架构相结合,引入卷积和自注意力机制,并采用一种新颖的修改后的有效加权注意机制(该机制减少了自注意力的二次时间复杂度)来有效地和高效地识别暴力活动。这种方法旨在克服传统挑战,如在视频中捕捉距离或部分遮挡的目标。通过关注局部和全局的时间特征,CUE-Net在RWF-2000和RLVS数据集上实现了最先进的性能,超过了现有方法。
https://arxiv.org/abs/2404.18952
Infrared and visible image fusion (IVIF) aims to preserve thermal radiation information from infrared images while integrating texture details from visible images, enabling the capture of important features and hidden details of subjects in complex scenes and disturbed environments. Consequently, IVIF offers distinct advantages in practical applications such as video surveillance, night navigation, and target recognition. However, prevailing methods often face challenges in simultaneously capturing thermal region features and detailed information due to the disparate characteristics of infrared and visible images. Consequently, fusion outcomes frequently entail a compromise between thermal target area information and texture details. In this study, we introduce a novel heterogeneous dual-discriminator generative adversarial network (HDDGAN) to address this issue. Specifically, the generator is structured as a multi-scale skip-connected structure, facilitating the extraction of essential features from different source images. To enhance the information representation ability of the fusion result, an attention mechanism is employed to construct the information fusion layer within the generator, leveraging the disparities between the source images. Moreover, recognizing the distinct learning requirements of information in infrared and visible images, we design two discriminators with differing structures. This approach aims to guide the model to learn salient information from infrared images while simultaneously capturing detailed information from visible images. Extensive experiments conducted on various public datasets demonstrate the superiority of our proposed HDDGAN over other state-of-the-art (SOTA) algorithms, highlighting its enhanced potential for practical applications.
红外和可见图像融合(IVIF)旨在保留红外图像的热辐射信息,同时整合可见图像的纹理细节,从而使捕捉复杂场景和受干扰环境中的主题重要特征和隐藏细节成为可能。因此,在实际应用中,例如视频监控、夜间导航和目标识别,IVIF具有显著的优势。然而,由于红外和可见图像的差异特征,现有的方法在同时捕捉热区域特征和详细信息方面常常面临挑战。因此,融合结果通常需要在热目标区域信息与纹理细节之间做出权衡。在这项研究中,我们引入了一种新颖的异质双判别器生成对抗网络(HDDGAN)来解决这一问题。具体来说,生成器采用多尺度跳转连接结构,促进从不同源图像中提取关键特征。为了增强融合结果的信息表示能力,采用关注机制在生成器中构建信息融合层,利用源图像之间的差异。此外,考虑到红外和可见图像之间的不同学习需求,我们设计了两部分结构不同的判别器。这种方法旨在指导模型从红外图像中学习显著信息,同时从可见图像中捕捉详细信息。在各种公开数据集上进行的大量实验证明,与最先进的(SOTA)算法相比,我们提出的HDDGAN具有卓越的实用性能,强调了其在实际应用中的潜在优势。
https://arxiv.org/abs/2404.15992
Spatiotemporal networks' observational capabilities are crucial for accurate data gathering and informed decisions across multiple sectors. This study focuses on the Spatiotemporal Ranged Observer-Observable Bipartite Network (STROOBnet), linking observational nodes (e.g., surveillance cameras) to events within defined geographical regions, enabling efficient monitoring. Using data from Real-Time Crime Camera (RTCC) systems and Calls for Service (CFS) in New Orleans, where RTCC combats rising crime amidst reduced police presence, we address the network's initial observational imbalances. Aiming for uniform observational efficacy, we propose the Proximal Recurrence approach. It outperformed traditional clustering methods like k-means and DBSCAN by offering holistic event frequency and spatial consideration, enhancing observational coverage.
空间时间网络的观测能力对于准确收集数据和跨多个领域的明智决策至关重要。本研究重点关注Spatiotemporal Ranged Observer-Observable Bipartite Network (STROOBnet),该网络将观测节点(例如,监控摄像头)与定义地理区域内的事件相连,实现高效的监测。利用来自实时犯罪 camera(RTCC)系统和求救电话(CFS)的数据,其中RTCC 在新奥良面对减少警力上升犯罪的情况下,我们解决了网络的初始观测不平衡问题。为了实现统一的观测效果,我们提出了Proximal Recurrence方法。与传统的聚类方法(如k-means 和 DBSCAN)相比,该方法提供了全面的事件频率和空间考虑,提高了观测覆盖。
https://arxiv.org/abs/2404.14388
Hyperspectral imaging (HSI) is a key technology for earth observation, surveillance, medical imaging and diagnostics, astronomy and space exploration. The conventional technology for HSI in remote sensing applications is based on the push-broom scanning approach in which the camera records the spectral image of a stripe of the scene at a time, while the image is generated by the aggregation of measurements through time. In real-world airborne and spaceborne HSI instruments, some empty stripes would appear at certain locations, because platforms do not always maintain a constant programmed attitude, or have access to accurate digital elevation maps (DEM), and the travelling track is not necessarily aligned with the hyperspectral cameras at all times. This makes the enhancement of the acquired HS images from incomplete or corrupted observations an essential task. We introduce a novel HSI inpainting algorithm here, called Hyperspectral Equivariant Imaging (Hyper-EI). Hyper-EI is a self-supervised learning-based method which does not require training on extensive datasets or access to a pre-trained model. Experimental results show that the proposed method achieves state-of-the-art inpainting performance compared to the existing methods.
hyperspectral imaging(HSI)是一种用于地球观测、监控、医学成像和诊断、天文学和太空探索的关键技术。在遥感应用中,传统的HSI技术是基于扫描方法,即相机在一次拍摄中记录场景的条带光谱图像,然后通过时间累积测量结果来生成图像。在实际的航空和航天器HSI仪器中,在某些位置会看到一些空条带,因为平台并不总是保持恒定的程序化姿态,或者无法访问准确的数字高程图(DEM),并且飞行轨迹不一定与所有时刻的 hyperspectral 相机对齐。这使得从 incomplete 或 corrupted observations 中增强已获得 HS 图像成为一个必要任务。我们在这里介绍了一种名为 Hyperpectral Equivariant Imaging(Hyper-EI)的新型HSI修复算法。Hyper-EI是一种自监督学习方法,不需要在大量数据集上进行训练或访问预训练模型。实验结果表明,与现有方法相比,所提出的方法在修复效果方面实现了最先进的水平。
https://arxiv.org/abs/2404.13159
Deploying mobile robots in construction sites to collaborate with workers or perform automated tasks such as surveillance and inspections carries the potential to greatly increase productivity, reduce human errors, and save costs. However ensuring human safety is a major concern, and the rough and dynamic construction environments pose multiple challenges for robot deployment. In this paper, we present the insights we obtained from our collaborations with construction companies in Canada and discuss our experiences deploying a semi-autonomous mobile robot in real construction scenarios.
在建筑工地部署移动机器人与工人合作或执行自动任务,如监视和检查,具有大大提高生产率、降低人为错误和节省成本的潜力。然而,确保人身安全是一个重要问题,而且粗野和动态的建筑环境对机器人部署提出了多项挑战。在本文中,我们分享了我们与加拿大建筑公司在合作中获得的见解,并讨论了我们在大规模建筑场景中部署半自主移动机器人的经验。
https://arxiv.org/abs/2404.13143
Underwater images taken from autonomous underwater vehicles (AUV's) often suffer from low light, high turbidity, poor contrast, motion-blur and excessive light scattering and hence require image enhancement techniques for object recognition. Machine learning methods are being increasingly used for object recognition under such adverse conditions. These enhanced object recognition methods of images taken from AUV's has potential applications in underwater pipeline and optical fibre surveillance, ocean bed resource extraction, ocean floor mapping, underwater species exploration, etc. While the classical machine learning methods are very efficient in terms of accuracy, they require large datasets and high computational time for image classification. In the current work, we use quantum-classical hybrid machine learning methods for real-time under-water object recognition on-board an AUV for the first time. We use real-time motion-blurred and low-light images taken from an on-board camera of AUV built in-house and apply existing hybrid machine learning methods for object recognition. Our hybrid methods consist of quantum encoding and flattening of classical images using quantum circuits and sending them to classical neural networks for image classification. The results of hybrid methods carried out using Pennylane based quantum simulators both on GPU and using pre-trained models on an on-board NVIDIA GPU chipset are compared with results from corresponding classical machine learning methods. We observe that the hybrid quantum machine learning methods show an efficiency greater than 65\% and reduction in run-time by one-thirds and require 50\% smaller dataset sizes for training the models compared to classical machine learning methods. We hope that our work opens up further possibilities in quantum enhanced real-time computer vision in autonomous vehicles.
自主水下车辆(AUV)拍摄的水下图像通常存在低光、高浊度、对比度差、运动模糊和过度光线散射等问题,因此需要图像增强技术来进行目标识别。机器学习方法在AUV拍摄的水下图像目标识别方面得到了越来越多的应用。利用AUV拍摄的水下图像的增强目标识别方法具有潜在的应用,如水下管道和光纤监测、海底资源开采、海底地形图、水下物种探索等。尽管经典的机器学习方法在准确性方面非常有效,但它们需要大量数据和高的计算时间进行图像分类。在当前工作中,我们使用量子经典混合机器学习方法进行AUV上实时水下物体识别,这是第一次在AUV上实现。我们使用AUV自带相机上的实时运动模糊和低光图像,并应用现有的混合机器学习方法进行目标识别。我们的混合方法包括量子编码和经典图像平铺,利用量子电路对经典图像进行量子编码,并将其发送到经典神经网络进行图像分类。使用Pennylane基于量子模拟器的混合方法在GPU和预训练的模型上进行的结果与相应的经典机器学习方法的结果进行了比较。我们观察到,混合量子机器学习方法显示出比经典机器学习方法超过65%的效率,并且在运行时间上减少了三分之一,同时训练模型的数据集需要量比经典方法小50%。我们希望我们的工作为自主车辆的量子增强实时计算机视觉开辟更广阔的可能性。
https://arxiv.org/abs/2404.13130
Current methods for 3D reconstruction and environmental mapping frequently face challenges in achieving high precision, highlighting the need for practical and effective solutions. In response to this issue, our study introduces FlyNeRF, a system integrating Neural Radiance Fields (NeRF) with drone-based data acquisition for high-quality 3D reconstruction. Utilizing unmanned aerial vehicle (UAV) for capturing images and corresponding spatial coordinates, the obtained data is subsequently used for the initial NeRF-based 3D reconstruction of the environment. Further evaluation of the reconstruction render quality is accomplished by the image evaluation neural network developed within the scope of our system. According to the results of the image evaluation module, an autonomous algorithm determines the position for additional image capture, thereby improving the reconstruction quality. The neural network introduced for render quality assessment demonstrates an accuracy of 97%. Furthermore, our adaptive methodology enhances the overall reconstruction quality, resulting in an average improvement of 2.5 dB in Peak Signal-to-Noise Ratio (PSNR) for the 10% quantile. The FlyNeRF demonstrates promising results, offering advancements in such fields as environmental monitoring, surveillance, and digital twins, where high-fidelity 3D reconstructions are crucial.
目前用于3D建模和环境建模的方法通常很难实现高精度,这凸显了需要实际有效的解决方案。为了应对这个问题,我们的研究引入了FlyNeRF,一种将神经辐射场(NeRF)与无人机数据采集相结合的高质量3D建模系统。利用无人机捕获图像和相关空间坐标,然后将获得的数据用于环境中的最初NeRF-based 3D建模。通过系统内图像评估神经网络进一步评估建模渲染质量。根据图像评估模块的结果,自适应算法确定附加图像捕捉的位置,从而提高建模质量。用于建模质量评估的神经网络表现出97%的准确度。此外,我们的自适应方法提高了整体建模质量,使得10%分位数上的峰值信号-噪声比(PSNR)平均提高了2.5分贝。FlyNeRF显示出鼓舞人心的结果,为环境监测、监视和数字孪生等领域提供了进步,这些领域对高保真3D建模至关重要。
https://arxiv.org/abs/2404.12970
The Segment Anything Model (SAM) is a deep neural network foundational model designed to perform instance segmentation which has gained significant popularity given its zero-shot segmentation ability. SAM operates by generating masks based on various input prompts such as text, bounding boxes, points, or masks, introducing a novel methodology to overcome the constraints posed by dataset-specific scarcity. While SAM is trained on an extensive dataset, comprising ~11M images, it mostly consists of natural photographic images with only very limited images from other modalities. Whilst the rapid progress in visual infrared surveillance and X-ray security screening imaging technologies, driven forward by advances in deep learning, has significantly enhanced the ability to detect, classify and segment objects with high accuracy, it is not evident if the SAM zero-shot capabilities can be transferred to such modalities. This work assesses SAM capabilities in segmenting objects of interest in the X-ray/infrared modalities. Our approach reuses the pre-trained SAM with three different prompts: bounding box, centroid and random points. We present quantitative/qualitative results to showcase the performance on selected datasets. Our results show that SAM can segment objects in the X-ray modality when given a box prompt, but its performance varies for point prompts. Specifically, SAM performs poorly in segmenting slender objects and organic materials, such as plastic bottles. We find that infrared objects are also challenging to segment with point prompts given the low-contrast nature of this modality. This study shows that while SAM demonstrates outstanding zero-shot capabilities with box prompts, its performance ranges from moderate to poor for point prompts, indicating that special consideration on the cross-modal generalisation of SAM is needed when considering use on X-ray/infrared imagery.
Segment Anything Model (SAM)是一种深度神经网络基础模型,旨在执行实例分割,由于其零 shot分割能力而获得了显著的流行。SAM通过根据各种输入提示生成掩码来操作,引入了一种新的方法来克服数据集特异性稀疏性的限制。尽管SAM在广泛的训练数据集上进行训练,包括~11M张图像,但它主要由仅包含非常有限其他模态图像的自然摄影图像组成。尽管随着深度学习技术的进步,视觉红外监视和X射线安全筛选成像技术的发展,大大提高了检测、分类和分割物体的准确性,但目前尚不清楚SAM的零 shot分割能力是否可以应用到这种模态。 本文评估了SAM在X-ray/红外模态中分割物体的能力。我们的方法重用了预训练的SAM,并使用三种不同的提示:边界框、中心点和随机点。我们提供了定量/定性结果,以展示SAM在这些选定数据集上的性能。我们的结果表明,当给定边界框提示时,SAM可以在X-ray模态上分割物体,但性能因点提示而异。具体来说,SAM在分割细长物体和有机材料(如塑料瓶)方面表现不佳。我们发现,由于这种模态的低对比度性质,红外物体也难以通过点提示进行分割。 本研究显示,尽管SAM在边界框提示下表现出出色的零 shot能力,但其在点提示下的性能从中等到差,表明在考虑在X-ray/红外图像上使用SAM时,需要特别注意跨模态通用性。
https://arxiv.org/abs/2404.12285
Surveillance footage represents a valuable resource and opportunities for conducting gait analysis. However, the typical low quality and high noise levels in such footage can severely impact the accuracy of pose estimation algorithms, which are foundational for reliable gait analysis. Existing literature suggests a direct correlation between the efficacy of pose estimation and the subsequent gait analysis results. A common mitigation strategy involves fine-tuning pose estimation models on noisy data to improve robustness. However, this approach may degrade the downstream model's performance on the original high-quality data, leading to a trade-off that is undesirable in practice. We propose a processing pipeline that incorporates a task-targeted artifact correction model specifically designed to pre-process and enhance surveillance footage before pose estimation. Our artifact correction model is optimized to work alongside a state-of-the-art pose estimation network, HRNet, without requiring repeated fine-tuning of the pose estimation model. Furthermore, we propose a simple and robust method for obtaining low quality videos that are annotated with poses in an automatic manner with the purpose of training the artifact correction model. We systematically evaluate the performance of our artifact correction model against a range of noisy surveillance data and demonstrate that our approach not only achieves improved pose estimation on low-quality surveillance footage, but also preserves the integrity of the pose estimation on high resolution footage. Our experiments show a clear enhancement in gait analysis performance, supporting the viability of the proposed method as a superior alternative to direct fine-tuning strategies. Our contributions pave the way for more reliable gait analysis using surveillance data in real-world applications, regardless of data quality.
监视视频资料是一种宝贵的资源和进行姿态分析的机会。然而,这类视频的低质量和高噪声水平可能会严重影响姿态估计算法的准确性,这些算法是可靠姿态分析的基础。现有文献表明,姿态估计的有效性与后续的姿态分析结果之间存在直接关系。一种常见的缓解策略是在噪声数据上对姿态估计模型进行微调,以提高稳健性。然而,这种方法可能会在原始高质量数据上降低下游模型的性能,导致在实践中不必要的权衡。我们提出了一个处理流程,其中包含一个专门针对任务目标进行预处理和增强的监视视频处理模型。我们的预处理和增强模型与最先进的姿态估计网络——HRNet——协同工作,无需反复微调姿态估计模型。此外,我们提出了一种简单而鲁棒的方法,用于自动标注带有姿态的低质量视频,以训练预处理和增强模型。我们系统地评估了我们的预处理模型的性能,并证明我们的方法不仅能在低质量监视视频上实现 improved pose estimation,还能在高质量视频上保留姿态估计的完整性。我们的实验显示,我们的预处理模型在姿态分析性能上明显增强,支持了所提出的利用监视数据进行更可靠姿态分析作为直接微调策略的替代品。我们的贡献为使用监视数据进行更可靠姿态分析在现实应用中铺平道路,而无需考虑数据质量。
https://arxiv.org/abs/2404.12183