With the emergence of new flapping-wing micro aerial vehicle (FWMAV) designs, a need for extensive and advanced mission capabilities arises. FWMAVs try to adapt and emulate the flight features of birds and flying insects. While current designs already achieve high manoeuvrability, they still almost entirely lack perching and take-off abilities. These capabilities could, for instance, enable long-term monitoring and surveillance missions, and operations in cluttered environments or in proximity to humans and animals. We present the development and testing of a framework that enables repeatable perching and take-off for small to medium-sized FWMAVs, utilising soft, non-damaging grippers. Thanks to its novel active-passive actuation system, an energy-conserving state can be achieved and indefinitely maintained while the vehicle is perched. A prototype of the proposed system weighing under 39 g was manufactured and extensively tested on a 110 g flapping-wing robot. Successful free-flight tests demonstrated the full mission cycle of landing, perching and subsequent take-off. The telemetry data recorded during the flights yields extensive insight into the system's behaviour and is a valuable step towards full automation and optimisation of the entire take-off and landing cycle.
随着新型的飘浮式机翼微机动验证设计的出现,对具备广泛而先进任务能力的需求变得突出。这些设计试图适应和模仿鸟类和飞行的昆虫的飞行特征。虽然现有的设计已经实现了高度的可操纵性,但它们仍然几乎完全缺乏停落和起飞能力。这些能力本可以例如实现长时间的监视和侦查任务,以及在拥挤的环境中或靠近人类和动物的操作。我们提出了一个框架,可以重复为中小型飘浮式机翼提供停落和起飞,利用柔软,非破坏性爪子。由于其新颖的主动-被动驱动系统,可以在车辆停落时实现节能状态并保持无限长时间。通过制造一个不到39克重量的原型并对其进行大量测试,证明了其在110克飘浮式机翼机器人上的优越性能。成功的自由飞行测试证明了整个起降任务的全面运行。在飞行过程中记录的遥测数据提供了对系统行为的广泛了解,是实现整个起降周期完全自动化和优化的宝贵一步。
https://arxiv.org/abs/2409.11921
Biometric recognition systems, known for their convenience, are widely adopted across various fields. However, their security faces risks depending on the authentication algorithm and deployment environment. Current risk assessment methods faces significant challenges in incorporating the crucial factor of attacker's motivation, leading to incomplete evaluations. This paper presents a novel human-centered risk evaluation framework using conjoint analysis to quantify the impact of risk factors, such as surveillance cameras, on attacker's motivation. Our framework calculates risk values incorporating the False Acceptance Rate (FAR) and attack probability, allowing comprehensive comparisons across use cases. A survey of 600 Japanese participants demonstrates our method's effectiveness, showing how security measures influence attacker's motivation. This approach helps decision-makers customize biometric systems to enhance security while maintaining usability.
生物识别系统以其便利性而闻名,已广泛应用于各个领域。然而,其安全性面临着根据认证算法和部署环境而定的风险。目前的风险评估方法在纳入攻击者的动机这一关键因素方面面临重大挑战,导致评价不完整。本文使用组合分析提出了一种以人为中心的风险评估框架,量化了诸如监控摄像头等风险因素对攻击者动机的影
https://arxiv.org/abs/2409.11224
Weakly supervised video anomaly detection (WS-VAD) is a crucial area in computer vision for developing intelligent surveillance systems. This system uses three feature streams: RGB video, optical flow, and audio signals, where each stream extracts complementary spatial and temporal features using an enhanced attention module to improve detection accuracy and robustness. In the first stream, we employed an attention-based, multi-stage feature enhancement approach to improve spatial and temporal features from the RGB video where the first stage consists of a ViT-based CLIP module, with top-k features concatenated in parallel with I3D and Temporal Contextual Aggregation (TCA) based rich spatiotemporal features. The second stage effectively captures temporal dependencies using the Uncertainty-Regulated Dual Memory Units (UR-DMU) model, which learns representations of normal and abnormal data simultaneously, and the third stage is employed to select the most relevant spatiotemporal features. The second stream extracted enhanced attention-based spatiotemporal features from the flow data modality-based feature by taking advantage of the integration of the deep learning and attention module. The audio stream captures auditory cues using an attention module integrated with the VGGish model, aiming to detect anomalies based on sound patterns. These streams enrich the model by incorporating motion and audio signals often indicative of abnormal events undetectable through visual analysis alone. The concatenation of the multimodal fusion leverages the strengths of each modality, resulting in a comprehensive feature set that significantly improves anomaly detection accuracy and robustness across three datasets. The extensive experiment and high performance with the three benchmark datasets proved the effectiveness of the proposed system over the existing state-of-the-art system.
弱监督视频异常检测(WS-VAD)是计算机视觉中一个关键领域,用于开发智能监视系统。该系统使用三个特征流:RGB视频、光学流和音频信号,其中每个流通过增强注意模块提取互补的空间和时间特征,以提高检测精度和鲁棒性。 在第一流中,我们采用基于注意的级联特征增强方法来改善从RGB视频中的空间和时间特征,其中第一阶段包括基于ViT的CLIP模块, top-k特征与I3D和Temporal Contextual Aggregation (TCA)的丰富空间和时间特征并行。第二阶段有效地利用了的不确定性调节双内存单元(UR-DMU)模型来捕捉时间依赖关系。第三阶段用于选择最具相关性的空间和时间特征。第二流从流数据模式下的特征中提取增强注意的时空特征,通过利用深度学习和注意模块的集成来增强。音频流使用与VGGish模型集成的注意模块来捕捉音频线索,旨在基于声音模式检测异常。这些流通过整合运动和音频信号,往往无法通过视觉分析检测到的异常事件来丰富模型。多模态融合的串联利用了每个模态的优势,导致一个全面的特征集,显著提高了跨三个数据集的异常检测精度和鲁棒性。大量实验和三个基准数据集的高性能证明,与现有技术水平相比,所提出的系统具有显著的优越性。
https://arxiv.org/abs/2409.11223
3D face reconstruction (3DFR) algorithms are based on specific assumptions tailored to distinct application scenarios. These assumptions limit their use when acquisition conditions, such as the subject's distance from the camera or the camera's characteristics, are different than expected, as typically happens in video surveillance. Additionally, 3DFR algorithms follow various strategies to address the reconstruction of a 3D shape from 2D data, such as statistical model fitting, photometric stereo, or deep learning. In the present study, we explore the application of three 3DFR algorithms representative of the SOTA, employing each one as the template set generator for a face verification system. The scores provided by each system are combined by score-level fusion. We show that the complementarity induced by different 3DFR algorithms improves performance when tests are conducted at never-seen-before distances from the camera and camera characteristics (cross-distance and cross-camera settings), thus encouraging further investigations on multiple 3DFR-based approaches.
3D面部重建(3DFR)算法是基于特定应用场景的假设来设计的。这些假设在获取条件(例如被摄对象与相机之间的距离或相机特性)与预期情况有差异时,限制了它们的使用。此外,3DFR算法采取各种策略来解决从2D数据中重构3D形状,例如统计模型拟合、光度立体或深度学习。在当前的研究中,我们探讨了三个代表当前最先进水平的3DFR算法的应用,将每个算法作为面部验证系统的模板集生成器。这些系统提供的分数通过分数级的融合进行结合。我们证明了不同3DFR算法引起的互补性在测试距离 camera 和 camera特性(跨距和跨相机设置)从不见过的情况时改善了性能,因此鼓励进一步研究多个基于3DFR的方法。
https://arxiv.org/abs/2409.10481
Despite the remarkable performance of deep neural networks for face detection and recognition tasks in the visible spectrum, their performance on more challenging non-visible domains is comparatively still lacking. While significant research has been done in the fields of domain adaptation and domain generalization, in this paper we tackle scenarios in which these methods have limited applicability owing to the lack of training data from target domains. We focus on the problem of single-source (visible) and multi-target (SWIR, long-range/remote, surveillance, and body-worn) face recognition task. We show through experiments that a good template generation algorithm becomes crucial as the complexity of the target domain increases. In this context, we introduce a template generation algorithm called Norm Pooling (and a variant known as Sparse Pooling) and show that it outperforms average pooling across different domains and networks, on the IARPA JANUS Benchmark Multi-domain Face (IJB-MDF) dataset.
尽管在可见光谱范围内,深度神经网络在 face 检测和识别任务中的表现非常出色,但它们在更具有挑战性的非可见领域中的性能相对仍然不足。尽管在领域迁移和泛化领域已经进行了大量的研究,但在本文中,我们关注的是由于目标领域缺乏训练数据而使得这些方法的应用受限的情况。我们专注于单一源(可见)和多目标(SWIR,远距离/远程,监视,可穿戴)面部识别问题。我们通过实验证明,随着目标领域的复杂性的增加,一个好的模板生成算法变得至关重要。在这种情况下,我们引入了一个名为 Norm Pooling(及其变体 Sparse Pooling)的模板生成算法,并在 IARPA JANUS 基准多领域面部(IJB-MDF)数据集上证明了它优于不同领域和网络的平均池化。
https://arxiv.org/abs/2409.09832
Abnormal event detection or anomaly detection in surveillance videos is currently a challenge because of the diversity of possible events. Due to the lack of anomalous events at training time, anomaly detection requires the design of learning methods without supervision. In this work we propose an unsupervised approach for video anomaly detection with the aim to jointly optimize the objectives of the deep neural network and the anomaly detection task using a hybrid architecture. Initially, a convolutional autoencoder is pre-trained in an unsupervised manner with a fusion of depth, motion and appearance features. In the second step, we utilize the encoder part of the pre-trained autoencoder and extract the embeddings of the fused input. Now, we jointly train/ fine tune the encoder to map the embeddings to a hypercenter. Thus, embeddings of normal data fall near the hypercenter, whereas embeddings of anomalous data fall far away from the hypercenter.
异常事件检测或异常检测在监视视频中的挑战在于可能事件种类的多样性。由于训练时间缺乏异常事件,因此需要设计无需监督的异常检测学习方法。在这项工作中,我们提出了一种无需监督的视频异常检测方法,旨在使用混合架构共同优化深度神经网络和异常检测任务的 objectives。首先,在无需监督的情况下使用卷积自编码器预处理,通过融合深度、运动和特征来提高性能。在第二步中,我们利用预训练自编码器的编码器部分,并提取融合输入的嵌入。现在,我们共同训练/微调编码器以将嵌入映射到超圆心。因此,正常数据的嵌入接近超圆心,而异常数据的嵌入则远离超圆心。
https://arxiv.org/abs/2409.09804
In this paper, we jointly combine image classification and image denoising, aiming to enhance human perception of noisy images captured by edge devices, like low-light security cameras. In such settings, it is important to retain the ability of humans to verify the automatic classification decision and thus jointly denoise the image to enhance human perception. Since edge devices have little computational power, we explicitly optimize for efficiency by proposing a novel architecture that integrates the two tasks. Additionally, we alter a Neural Architecture Search (NAS) method, which searches for classifiers to search for the integrated model while optimizing for a target latency, classification accuracy, and denoising performance. The NAS architectures outperform our manually designed alternatives in both denoising and classification, offering a significant improvement to human perception. Our approach empowers users to construct architectures tailored to domains like medical imaging, surveillance systems, and industrial inspections.
在本文中,我们共同结合图像分类和图像去噪,旨在提高边缘设备(如低光安全摄像头)捕获到的噪声图像中人类的感知。在这种设置中,保留人类验证自动分类决策的能力非常重要,从而共同去噪以提高人类的感知。由于边缘设备具有较少的计算能力,我们通过提出一种新颖的架构,将两种任务整合在一起,明显优化了效率。此外,我们改变了一个神经架构搜索(NAS)方法,该方法在优化目标延迟、分类准确性和去噪性能的同时,寻找分类器。与我们的自定义选项相比,NAS架构在去噪和分类方面都表现出色,显著提高了人类的感知。我们的方法使用户能够构建适应领域,如医学成像、监控系统和工业检查的架构。
https://arxiv.org/abs/2409.08943
A fundamental task in mobile robotics is to keep an agent under surveillance using an autonomous robotic platform equipped with a sensing device. Using differential game theory, we study a particular setup of the previous problem. A Differential Drive Robot (DDR) equipped with a bounded range sensor wants to keep surveillance of an Omnidirectional Agent (OA). The goal of the DDR is to maintain the OA inside its detection region for as much time as possible, while the OA, having the opposite goal, wants to leave the regions as soon as possible. We formulate the problem as a zero-sum differential game, and we compute the time-optimal motion strategies of the players to achieve their goals. We focus on the case where the OA is faster than the DDR. Given the OA's speed advantage, a winning strategy for the OA is always moving radially outwards to the DDR's position. However, this work shows that even though the previous strategy could be optimal in some cases, more complex motion strategies emerge based on the players' speed ratio. In particular, we exhibit that four classes of singular surfaces may appear in this game: Dispersal, Transition, Universal, and Focal surfaces. Each one of those surfaces implies a particular motion strategy for the players.
移动机器人学的一个基本任务是使用配备有感知器的自主机器人平台来监控代理程序。通过微分游戏理论,我们研究了先前问题中的特定设置。配备有约束范围传感器的Differential Drive Robot(DDR)希望监控Omni-directional代理(OA)。DDR的目标是在其检测区域内保持OA的最大时间,而OA的目标是尽快离开这些区域。我们将问题转化为一个零和微分游戏,并计算玩家的最优运动策略以实现他们的目标。我们重点关注OA速度大于DDR的情况。 由于OA的速度优势,OA的获胜策略总是沿着DDR的位置向外移动。然而,这项工作表明,即使在前面的策略在一些情况下是最优的,根据玩家的速度比,还会出现更复杂的运动策略。特别地,我们展示了这个游戏中可能出现四种单峰表面:扩散、转换、普遍和焦点表面。每种表面都表示玩家的一种运动策略。
https://arxiv.org/abs/2409.08414
Sound Source Localization (SSL) enabling technology for applications such as surveillance and robotics. While traditional Signal Processing (SP)-based SSL methods provide analytic solutions under specific signal and noise assumptions, recent Deep Learning (DL)-based methods have significantly outperformed them. However, their success depends on extensive training data and substantial computational resources. Moreover, they often rely on large-scale annotated spatial data and may struggle when adapting to evolving sound classes. To mitigate these challenges, we propose a novel Class Incremental Learning (CIL) approach, termed SSL-CIL, which avoids serious accuracy degradation due to catastrophic forgetting by incrementally updating the DL-based SSL model through a closed-form analytic solution. In particular, data privacy is ensured since the learning process does not revisit any historical data (exemplar-free), which is more suitable for smart home scenarios. Empirical results in the public SSLR dataset demonstrate the superior performance of our proposal, achieving a localization accuracy of 90.9%, surpassing other competitive methods.
声音源定位(SSL)启用了技术,应用于诸如监视和机器人技术等场景。虽然传统的信号处理(SP)方法在特定信号和噪声假设下提供分析解决方案,但最近基于深度学习(DL)的方法显著超过了它们。然而,它们的成功取决于广泛的训练数据和大量的计算资源。此外,它们通常依赖于大规模注释的局部数据,并且在适应不断变化的声类时可能会遇到困难。为了减轻这些挑战,我们提出了一个名为SSL-CIL的新分类递增学习(CIL)方法,通过通过闭式形式分析解决方案来逐步更新基于DL的SSL模型,从而避免了由于灾难性遗忘而导致的严重准确率下降。特别地,学习过程不会回归任何历史数据(示例-无),这使得智能家居场景更加适合。在公共SSLR数据集上的实证结果表明,与其他竞争方法相比,我们的建议具有优越的性能,达到90.9%的定位准确率,超过了其他竞争方法。
https://arxiv.org/abs/2409.07224
In the current landscape of biometrics and surveillance, the ability to accurately recognize faces in uncontrolled settings is paramount. The Watchlist Challenge addresses this critical need by focusing on face detection and open-set identification in real-world surveillance scenarios. This paper presents a comprehensive evaluation of participating algorithms, using the enhanced UnConstrained College Students (UCCS) dataset with new evaluation protocols. In total, four participants submitted four face detection and nine open-set face recognition systems. The evaluation demonstrates that while detection capabilities are generally robust, closed-set identification performance varies significantly, with models pre-trained on large-scale datasets showing superior performance. However, open-set scenarios require further improvement, especially at higher true positive identification rates, i.e., lower thresholds.
在当前的人脸识别和监视领域,在未受控的环境中准确识别脸部至关重要。Watchlist挑战通过专注于现实主义监视场景中的人脸检测和开箱即用识别来解决这一关键需求。本文全面评估了参赛算法的性能,使用增强的UCCS数据集及其新的评估协议。总共四名参赛者提交了四个面部检测和九个开箱即用的人脸识别系统。评估表明,虽然检测能力通常很强,但开箱即用识别性能差异很大,在预训练于大型数据集的模型上表现优异。然而,开箱即用场景需要进一步改进,尤其是在更高的真阳性识别率上,即较低的阈值。
https://arxiv.org/abs/2409.07220
Thermography is especially valuable for the military and other users of surveillance cameras. Some recent methods based on Neural Radiance Fields (NeRF) are proposed to reconstruct the thermal scenes in 3D from a set of thermal and RGB images. However, unlike NeRF, 3D Gaussian splatting (3DGS) prevails due to its rapid training and real-time rendering. In this work, we propose ThermalGaussian, the first thermal 3DGS approach capable of rendering high-quality images in RGB and thermal modalities. We first calibrate the RGB camera and the thermal camera to ensure that both modalities are accurately aligned. Subsequently, we use the registered images to learn the multimodal 3D Gaussians. To prevent the overfitting of any single modality, we introduce several multimodal regularization constraints. We also develop smoothing constraints tailored to the physical characteristics of the thermal modality. Besides, we contribute a real-world dataset named RGBT-Scenes, captured by a hand-hold thermal-infrared camera, facilitating future research on thermal scene reconstruction. We conduct comprehensive experiments to show that ThermalGaussian achieves photorealistic rendering of thermal images and improves the rendering quality of RGB images. With the proposed multimodal regularization constraints, we also reduced the model's storage cost by 90\%. The code and dataset will be released.
热成像对于军事和其他监控摄像机用户来说特别有价值。一些基于神经辐射场(NeRF)的方法提出了从一系列热和红外的图像中重构三维热场景。然而,与NeRF不同,由于其快速训练和实时渲染,3D高斯分裂(3DGS)占据了优势。在这项工作中,我们提出了ThermalGaussian,是第一个在RGB和热模式下实现高质量图像渲染的三维热Gauss滤波器。我们首先校准了RGB相机和热相机,以确保两种模式准确对齐。然后,我们使用注册的图像来学习多模态3D高斯。为了防止单一模式的过拟合,我们引入了几个多模态 regularization 约束。我们还针对热模式的物理特征开发了平滑约束。此外,我们还贡献了一个名为RGBT-Scenes的手持式热红外摄像机捕获的实时热图像数据集,为进一步研究热场景重构提供了便利。我们对ThermalGaussian进行了全面实验,结果表明,ThermalGaussian实现了热图像的 photorealistic 渲染,提高了 RGB 图像的渲染质量。通过引入多模态 regularization constraints,我们还降低了模型的存储成本,使其降低了90%。代码和数据集将公开发布。
https://arxiv.org/abs/2409.07200
Oysters are a keystone species in coastal ecosystems, offering significant economic, environmental, and cultural benefits. However, current monitoring systems are often destructive, typically involving dredging to physically collect and count oysters. A nondestructive alternative is manual identification from video footage collected by divers, which is time-consuming and labor-intensive with expert input. An alternative to human monitoring is the deployment of a system with trained object detection models that performs real-time, on edge oyster detection in the field. One such platform is the Aqua2 robot. Effective training of these models requires extensive high-quality data, which is difficult to obtain in marine settings. To address these complications, we introduce a novel method that leverages stable diffusion to generate high-quality synthetic data for the marine domain. We exploit diffusion models to create photorealistic marine imagery, using ControlNet inputs to ensure consistency with the segmentation ground-truth mask, the geometry of the scene, and the target domain of real underwater images for oysters. The resulting dataset is used to train a YOLOv10-based vision model, achieving a state-of-the-art 0.657 mAP@50 for oyster detection on the Aqua2 platform. The system we introduce not only improves oyster habitat monitoring, but also paves the way to autonomous surveillance for various tasks in marine contexts, improving aquaculture and conservation efforts.
海蛎是一种沿海生态系统中的关键物种,为经济、环境和文化遗产带来重要好处。然而,当前的监测系统通常具有破坏性,通常涉及浅海挖泥以实地收集和计数海蛎。一种非破坏性的选择是利用潜水员收集的视频 footage 进行手动鉴定,这费时且劳动密集,需要专家的输入。人类监测的替代方法是部署一个带有训练过的物体检测模型的系统,在現場实时检测海蛎。这样的平台是Aqua2机器人。为有效训练这些模型,需要大量高品质数据,这在海洋环境中很难获得。为解决这些复杂问题,我们引入了一种利用稳定扩散生成高品质海洋领域合成数据的新方法。我们利用扩散模型创建逼真的海洋图像,使用ControlNet输入来确保与分割真实掩码、场景几何和目标海洋图像海蛎的一致性。所得的数据集用于训练一个基于YOLOv10的视觉模型,在Aqua2平台上实现最佳的海蛎检测的0.657 mAP@50。我们引入的系统不仅提高了海蛎栖息地的监测,还为各种海洋任务的自主监视铺平了道路,促进了水产养殖和保护工作。
https://arxiv.org/abs/2409.07003
Scene Change Detection (SCD) is vital for applications such as visual surveillance and mobile robotics. However, current SCD methods exhibit a bias to the temporal order of training datasets and limited performance on unseen domains; coventional SCD benchmarks are not able to evaluate generalization or temporal consistency. To tackle these limitations, we introduce a Generalizable Scene Change Detection Framework (GeSCF) in this work. The proposed GeSCF leverages localized semantics of a foundation model without any re-training or fine-tuning -- for generalization over unseen domains. Specifically, we design an adaptive thresholding of the similarity distribution derived from facets of the pre-trained foundation model to generate initial pseudo-change mask. We further utilize Segment Anything Model's (SAM) class-agnostic masks to refine pseudo-masks. Moreover, our proposed framework maintains commutative operations in all settings to ensure complete temporal consistency. Finally, we define new metrics, evaluation dataset, and evaluation protocol for Generalizable Scene Change Detection (GeSCD). Extensive experiments demonstrate that GeSCF excels across diverse and challenging environments -- establishing a new benchmark for SCD performance.
场景转换检测(SCD)对于诸如视频监控和移动机器人这样的应用至关重要。然而,目前的SCD方法存在训练数据的时间顺序偏差,并且在未见过的领域上的表现有限;传统的SCD基准无法评估泛化或时间一致性。为解决这些局限性,我们在本研究中引入了一个可扩展的场景转换检测框架(GeSCF)。与基础模型没有任何重新训练或微调的局部语义相比,所提出的GeSCF利用了预训练模型生成的特征的局部语义,用于在未见过的领域上的泛化。具体来说,我们设计了一个自适应阈值,基于预训练模型的特征,以生成初始伪变化掩码。我们进一步利用Segment Anything Model's(SAM)全类别的掩码来细化伪掩码。此外,我们的框架在所有设置中保持可交换操作,以确保完全的时间一致性。最后,我们定义了可扩展的场景转换检测(GeSCD)的新指标、评估数据和评估协议。丰富的实验证明,GeSCF在多样且具有挑战性的环境中表现出色,为SCD性能树立了新的基准。
https://arxiv.org/abs/2409.06214
With the rapid development of drone technology, accurate detection of Unmanned Aerial Vehicles (UAVs) has become essential for applications such as surveillance, security, and airspace management. In this paper, we propose a novel trajectory-guided method, the Patch Intensity Convergence (PIC) technique, which generates high-fidelity bounding boxes for UAV detection tasks and no need for the effort required for labeling. The PIC technique forms the foundation for developing UAVDB, a database explicitly created for UAV detection. Unlike existing datasets, which often use low-resolution footage or focus on UAVs in simple backgrounds, UAVDB employs high-resolution video to capture UAVs at various scales, ranging from hundreds of pixels to nearly single-digit sizes. This broad-scale variation enables comprehensive evaluation of detection algorithms across different UAV sizes and distances. Applying the PIC technique, we can also efficiently generate detection datasets from trajectory or positional data, even without size information. We extensively benchmark UAVDB using YOLOv8 series detectors, offering a detailed performance analysis. Our findings highlight UAVDB's potential as a vital database for advancing UAV detection, particularly in high-resolution and long-distance tracking scenarios.
随着无人机技术的快速发展,准确检测无人机(UAVs)已成为诸如监视、安全和空域管理等应用中不可或缺的关键。在本文中,我们提出了一个新的轨迹引导方法,即Patch Intensity Convergence(PIC)技术,为无人机检测任务生成高保真度的边界框,无需进行标签。PIC技术为开发UAVDB奠定了基础,这是一种专门为无人机检测创建的数据库。与现有数据集不同,它们通常使用低分辨率视频或关注于简单的背景中的无人机。UAVDB采用高分辨率视频捕捉各种大小的无人机,从几百个像素到几乎单数字的大小。这种大范围的变化使得可以在不同UAV大小和距离上全面评估检测算法的性能。此外,我们还可以从轨迹或位置数据 efficiently生成检测数据集,即使没有尺寸信息。我们通过使用YOLOv8系列检测器对UAVDB进行了广泛基准测试,提供了详细的表现分析。我们的研究结果突出了UAVDB在促进UAV检测方面的潜力,特别是在高分辨率和远距离跟踪场景中。
https://arxiv.org/abs/2409.06490
Video synopsis is an efficient method for condensing surveillance videos. This technique begins with the detection and tracking of objects, followed by the creation of object tubes. These tubes consist of sequences, each containing chronologically ordered bounding boxes of a unique object. To generate a condensed video, the first step involves rearranging the object tubes to maximize the number of non-overlapping objects in each frame. Then, these tubes are stitched to a background image extracted from the source video. The lack of a standard dataset for the video synopsis task hinders the comparison of different video synopsis models. This paper addresses this issue by introducing a standard dataset, called SynoClip, designed specifically for the video synopsis task. SynoClip includes all the necessary features needed to evaluate various models directly and effectively. Additionally, this work introduces a video synopsis model, called FGS, with low computational cost. The model includes an empty-frame object detector to identify frames empty of any objects, facilitating efficient utilization of the deep object detector. Moreover, a tube grouping algorithm is proposed to maintain relationships among tubes in the synthesized video. This is followed by a greedy tube rearrangement algorithm, which efficiently determines the start time of each tube. Finally, the proposed model is evaluated using the proposed dataset. The source code, fine-tuned object detection model, and tutorials are available at this https URL.
视频摘要是一种有效的压缩视频的方法。这种技术首先检测和跟踪物体,然后创建物体管。这些管由序列组成,每个序列包含一个独特物体的按时间排序的边界框。要生成压缩视频,第一步是将物体管重新排列以最大化每个帧中的非重叠物体的数量。然后,将这些管与从源视频中提取的背景图像拼接在一起。由于没有为视频摘要任务定义的标准数据集,因此不同视频摘要模型的比较受到了阻碍。本文通过引入一个专门为视频摘要任务设计的标准数据集——SynoClip,来解决这一问题。SynoClip包括所有直接评估各种模型所需的所有必要特征。此外,本文还提出了一种计算成本较低的压缩视频模型——FGS。该模型包括一个空闲帧物体检测器,用于识别视频中不含任何物体的帧,从而有效地利用深度物体检测器。此外,还提出了一种管束分组算法,以保持合成视频中管束之间的关系。接着是贪心管束重排算法,它有效地确定每个管束的启动时间。最后,所提出的模型使用所提出的数据集进行评估。源代码、微调后的物体检测模型和教程可在此链接处访问。
https://arxiv.org/abs/2409.05230
Camera trap imagery has become an invaluable asset in contemporary wildlife surveillance, enabling researchers to observe and investigate the behaviors of wild animals. While existing methods rely solely on image data for classification, this may not suffice in cases of suboptimal animal angles, lighting, or image quality. This study introduces a novel approach that enhances wild animal classification by combining specific metadata (temperature, location, time, etc) with image data. Using a dataset focused on the Norwegian climate, our models show an accuracy increase from 98.4% to 98.9% compared to existing methods. Notably, our approach also achieves high accuracy with metadata-only classification, highlighting its potential to reduce reliance on image quality. This work paves the way for integrated systems that advance wildlife classification technology.
相机陷阱图像已成为当代野生动物监测的宝贵财富,使研究人员能够观察和调查野生动物的行为。然而,现有方法仅依赖图像数据进行分类,这可能不足以应对动物角度、光线或图像质量不佳的情况。本研究介绍了一种新的方法,通过将特定的元数据(温度、位置、时间等)与图像数据相结合,增强了野生动物分类。使用关注挪威气候的 datasets,我们的模型与现有方法的准确性从98.4%增加至98.9%。值得注意的是,我们的方法在仅使用元数据进行分类时也具有很高的准确性,表明其可能减少对图像质量的依赖。本研究为集成系统的发展铺平了道路,推动了野生动物分类技术的发展。
https://arxiv.org/abs/2409.04825
Deep learning-based person re-identification (re-id) models are widely employed in surveillance systems and inevitably inherit the vulnerability of deep networks to adversarial attacks. Existing attacks merely consider cross-dataset and cross-model transferability, ignoring the cross-test capability to perturb models trained in different domains. To powerfully examine the robustness of real-world re-id models, the Meta Transferable Generative Attack (MTGA) method is proposed, which adopts meta-learning optimization to promote the generative attacker producing highly transferable adversarial examples by learning comprehensively simulated transfer-based cross-model\&dataset\&test black-box meta attack tasks. Specifically, cross-model\&dataset black-box attack tasks are first mimicked by selecting different re-id models and datasets for meta-train and meta-test attack processes. As different models may focus on different feature regions, the Perturbation Random Erasing module is further devised to prevent the attacker from learning to only corrupt model-specific features. To boost the attacker learning to possess cross-test transferability, the Normalization Mix strategy is introduced to imitate diverse feature embedding spaces by mixing multi-domain statistics of target models. Extensive experiments show the superiority of MTGA, especially in cross-model\&dataset and cross-model\&dataset\&test attacks, our MTGA outperforms the SOTA methods by 21.5\% and 11.3\% on mean mAP drop rate, respectively. The code of MTGA will be released after the paper is accepted.
基于深度学习的行人重识别(重识别)模型在监视系统中被广泛应用,并继承了深度网络对对抗攻击的脆弱性。现有的攻击仅考虑跨数据集和跨模型可传输性,而忽略了不同领域训练模型的交叉测试能力。为了有效地研究真实世界重识别模型的鲁棒性,提出了元传输可训练生成攻击(MTGA)方法,通过元学习优化促进生成攻击者通过全面模拟基于交叉模型的交叉测试黑盒攻击任务来产生高度可传输的对抗样本。 具体来说,通过选择不同的重识别模型和数据集,对元训练和元测试攻击过程进行跨模型跨数据集的攻击。由于不同的模型可能关注不同的特征区域,因此进一步设计干扰随机擦除模块,以防止攻击者仅学习模型特定特征的攻击方法。为了提高攻击者具有跨测试可传输性,引入了标准化混合策略,通过混合目标模型的多个领域统计数据来模拟多样特征嵌入空间。 大量实验证明,MTGA在跨模型和跨数据集攻击方面具有优越性。与最先进的攻击方法相比,MTGA在平均mAP下降率上分别提高了21.5%和11.3%。MTGA的代码将在论文被接受后发布。
https://arxiv.org/abs/2409.04208
In the current landscape of autonomous vehicle (AV) safety and security research, there are multiple isolated problems being tackled by the community at large. Due to the lack of common evaluation criteria, several important research questions are at odds with one another. For instance, while much research has been conducted on physical attacks deceiving AV perception systems, there is often inadequate investigations on working defenses and on the downstream effects of safe vehicle control. This paper provides a thorough description of the current state of AV safety and security research. We provide individual sections for the primary research questions that concern this research area, including AV surveillance, sensor system reliability, security of the AV stack, algorithmic robustness, and safe environment interaction. We wrap up the paper with a discussion of the issues that concern the interactions of these separate problems. At the conclusion of each section, we propose future research questions that still lack conclusive answers. This position article will serve as an entry point to novice and veteran researchers seeking to partake in this research domain.
在当前自动驾驶汽车(AV)安全和安保研究的格局中,整个社区都在解决多个孤立的问题。由于缺乏共同的评估标准,几个重要研究问题之间存在矛盾。例如,虽然针对欺骗AV感知系统的物理攻击的研究已经很多,但关于安全和安保系统的工作防御以及安全车辆控制的后端影响的研究却往往不足。本文对当前AV安全和安保研究的状态进行了详细描述。我们为这个研究领域的初级和高级研究问题分别提供了单独的部分,包括AV监控、传感器系统可靠性、AV堆栈的安全性、算法的鲁棒性以及安全环境交互。我们还在文章末尾讨论了这些问题之间的相互作用。在每个节段结束时,我们提出了尚无确凿答案的未来研究问题。这篇文章将成为新手和资深研究人员进入这个研究领域的一个门户。
https://arxiv.org/abs/2409.03899
In recent years, the development of deep learning approaches for the task of person re-identification led to impressive results. However, this comes with a limitation for industrial and practical real-world applications. Firstly, most of the existing works operate on closed-world scenarios, in which the people to re-identify (probes) are compared to a closed-set (gallery). Real-world scenarios often are open-set problems in which the gallery is not known a priori, but the number of open-set approaches in the literature is significantly lower. Secondly, challenges such as multi-camera setups, occlusions, real-time requirements, etc., further constrain the applicability of off-the-shelf methods. This work presents MICRO-TRACK, a Modular Industrial multi-Camera Re_identification and Open-set Tracking system that is real-time, scalable, and easy to integrate into existing industrial surveillance scenarios. Furthermore, we release a novel Re-ID and tracking dataset acquired in an industrial manufacturing facility, dubbed Facility-ReID, consisting of 18-minute videos captured by 8 surveillance cameras.
近年来,用于人物识别任务的深度学习方法的开发取得了令人印象深刻的成果。然而,这也限制了在工业和实际应用领域的应用。首先,大部分现有工作都在封闭世界场景中进行,其中被重新识别(探针)的人与一个封闭集(画廊)进行比较。现实世界场景通常是开放式问题,画廊不知道提前知道,但文书中公开的开放式方法的数量相当有限。其次,多摄像头设置、遮挡、实时要求等挑战进一步限制了通用方法的适用性。本文介绍了一种名为MICRO-TRACK的模块化工业多相机重新识别和开放式跟踪系统,具有实时性、可扩展性和易于集成到现有工业安防场景的特点。此外,我们还发布了在工业制造设施中获取的一组18分钟视频,由8个监控摄像头捕捉的数据,称为设施-ReID数据集。
https://arxiv.org/abs/2409.03879
Background: The 2024 Mpox outbreak, particularly severe in Africa with clade 1b emergence, has highlighted critical gaps in diagnostic capabilities in resource-limited settings. This study aimed to develop and validate an artificial intelligence (AI)-driven, on-device screening tool for Mpox, designed to function offline in low-resource environments. Methods: We developed a YOLOv8n-based deep learning model trained on 2,700 images (900 each of Mpox, other skin conditions, and normal skin), including synthetic data. The model was validated on 360 images and tested on 540 images. A larger external validation was conducted using 1,500 independent images. Performance metrics included accuracy, precision, recall, F1-score, sensitivity, and specificity. Findings: The model demonstrated high accuracy (96%) in the final test set. For Mpox detection, it achieved 93% precision, 97% recall, and an F1-score of 95%. Sensitivity and specificity for Mpox detection were 97% and 96%, respectively. Performance remained consistent in the larger external validation, confirming the model's robustness and generalizability. Interpretation: This AI-driven screening tool offers a rapid, accurate, and scalable solution for Mpox detection in resource-constrained settings. Its offline functionality and high performance across diverse datasets suggest significant potential for improving Mpox surveillance and management, particularly in areas lacking traditional diagnostic infrastructure.
背景:2024年Mpox疫情在非洲等资源受限地区凸显了在低资源环境中进行诊断能力的关键差距。本研究旨在开发和验证一种人工智能(AI)驱动的,离线在低资源环境中运行的Mpox检测工具。方法:我们开发了一个基于YOLOv8n的深度学习模型,在包括900张Mpox、其他皮肤病和正常皮肤的2700张图像上进行训练,包括合成数据。模型在360张图像上进行验证,并在540张图像上进行测试。进行了一个较大的独立验证,使用1500张图像。性能指标包括准确率、精确率、召回率、F1分数、敏感性和特异性。结果:在最终测试集中,该模型在Mpox检测方面表现出高准确率(96%)。对于Mpox检测,它实现了93%的精度、97%的召回率和95%的F1分数。Mpox检测的敏感性和特异性分别为97%和96%。在更大的独立验证中,性能保持不变,证实了该模型的稳健性和泛化能力。解释:这种AI驱动的筛查工具为在资源受限环境中快速、准确和可扩展的Mpox检测提供了快速解决方案。其离线功能和高性能 across 多样数据集表明,该工具具有显著提高Mpox监测和管理的能力,特别是在缺乏传统诊断基础设施的地区。
https://arxiv.org/abs/2409.03806