Classifying hyperspectral images is a difficult task in remote sensing, due to their complex high-dimensional data. To address this challenge, we propose HSIMamba, a novel framework that uses bidirectional reversed convolutional neural network pathways to extract spectral features more efficiently. Additionally, it incorporates a specialized block for spatial analysis. Our approach combines the operational efficiency of CNNs with the dynamic feature extraction capability of attention mechanisms found in Transformers. However, it avoids the associated high computational demands. HSIMamba is designed to process data bidirectionally, significantly enhancing the extraction of spectral features and integrating them with spatial information for comprehensive analysis. This approach improves classification accuracy beyond current benchmarks and addresses computational inefficiencies encountered with advanced models like Transformers. HSIMamba were tested against three widely recognized datasets Houston 2013, Indian Pines, and Pavia University and demonstrated exceptional performance, surpassing existing state-of-the-art models in HSI classification. This method highlights the methodological innovation of HSIMamba and its practical implications, which are particularly valuable in contexts where computational resources are limited. HSIMamba redefines the standards of efficiency and accuracy in HSI classification, thereby enhancing the capabilities of remote sensing applications. Hyperspectral imaging has become a crucial tool for environmental surveillance, agriculture, and other critical areas that require detailed analysis of the Earth surface. Please see our code in HSIMamba for more details.
分类超分辨率图像是一个遥感中的困难任务,因为它们具有复杂且高维的数据。为解决这个问题,我们提出了HSIMamba,一种新颖的方法,它使用双向反卷积神经网络通道来更有效地提取光谱特征。此外,它还包含一个专门的块来进行空间分析。我们的方法将CNN的操作效率与Transformer中发现的关注机制的动态特征提取能力相结合。然而,它避免了与高级模型(如Transformers)相关的较高计算需求。HSIMamba旨在处理双向数据,从而大大提高提取光谱特征的效率并将其与空间信息相结合进行全面的分析。这种方法在现有基准之外提高了分类准确性,并解决了与高级模型(如Transformers)相关的计算效率问题。HSIMamba与三个广泛认可的数据集Houston 2013,Indian Pines和Pavia University进行了测试,表现出色,超越了现有状态下的最先进模型在HSI分类中的水平。这种方法突出了HSIMamba的方法创新及其实用性,尤其是在计算资源有限的环境下,这一点尤为重要。HSIMamba重新定义了HSI分类的标准和精度,从而提高了遥感应用的功能。超分辨率成像已经成为环境监测、农业和其他需要对地球表面进行详细分析的关键领域的关键技术。更多详情,请参阅HSIMamba中的我们的代码。
https://arxiv.org/abs/2404.00272
Head pose estimation has become a crucial area of research in computer vision given its usefulness in a wide range of applications, including robotics, surveillance, or driver attention monitoring. One of the most difficult challenges in this field is managing head occlusions that frequently take place in real-world scenarios. In this paper, we propose a novel and efficient framework that is robust in real world head occlusion scenarios. In particular, we propose an unsupervised latent embedding clustering with regression and classification components for each pose angle. The model optimizes latent feature representations for occluded and non-occluded images through a clustering term while improving fine-grained angle predictions. Experimental evaluation on in-the-wild head pose benchmark datasets reveal competitive performance in comparison to state-of-the-art methodologies with the advantage of having a significant data reduction. We observe a substantial improvement in occluded head pose estimation. Also, an ablation study is conducted to ascertain the impact of the clustering term within our proposed framework.
头姿态估计已成为计算机视觉领域的一个重要研究课题,因为在广泛的应用领域中具有实用性,包括机器人、监视或驾驶员注意力监测。这个领域中最困难的一个挑战是处理在现实场景中经常发生的头部遮挡。在本文中,我们提出了一个新颖且高效的框架,对真实世界头部遮挡场景具有鲁棒性。特别是,我们提出了一个无监督的潜在嵌入聚类与回归和分类组件,用于每个姿态角度。通过聚类项优化遮挡和非遮挡图像的潜在特征表示,同时改善微小角度预测。在野外头部姿态基准数据集上的实验评估显示,与最先进的methodologies相比具有竞争性的性能,同时数据量减少显著。我们观察到,在提出的框架中,遮挡头部姿态估计取得了很大的改善。此外,我们还进行了一项消融研究,以确定在我们的框架中聚类项的影响。
https://arxiv.org/abs/2403.20251
Multi-target multi-camera tracking is a crucial task that involves identifying and tracking individuals over time using video streams from multiple cameras. This task has practical applications in various fields, such as visual surveillance, crowd behavior analysis, and anomaly detection. However, due to the difficulty and cost of collecting and labeling data, existing datasets for this task are either synthetically generated or artificially constructed within a controlled camera network setting, which limits their ability to model real-world dynamics and generalize to diverse camera configurations. To address this issue, we present MTMMC, a real-world, large-scale dataset that includes long video sequences captured by 16 multi-modal cameras in two different environments - campus and factory - across various time, weather, and season conditions. This dataset provides a challenging test-bed for studying multi-camera tracking under diverse real-world complexities and includes an additional input modality of spatially aligned and temporally synchronized RGB and thermal cameras, which enhances the accuracy of multi-camera tracking. MTMMC is a super-set of existing datasets, benefiting independent fields such as person detection, re-identification, and multiple object tracking. We provide baselines and new learning setups on this dataset and set the reference scores for future studies. The datasets, models, and test server will be made publicly available.
多目标多摄像机跟踪是一个关键任务,涉及使用来自多个摄像头的视频流识别和跟踪一段时间内的个体。这项任务在各种领域具有实际应用,如视频监控、人群行为分析和异常检测。然而,由于收集和标注数据的难度和成本,现有的数据集只能在受控的相机网络环境中构建,这限制了它们对建模现实世界动态和泛化到不同相机配置的能力。为了解决这个问题,我们提出了MTMMC,一个真实世界的大型数据集,其中包括16个多模态相机在校园和工厂环境中捕获的长时间视频序列。这个数据集为研究不同现实世界复杂性下的多相机跟踪提供了具有挑战性的测试平台,包括额外的输入模态:空间对齐和时间同步的RGB和热成像相机,从而提高了多相机跟踪的准确性。MTMMC是现有数据集中的超集, benefit independent fields such as person detection, re-identification, and multiple object tracking。我们在该数据集上提供了基线和新学习设置,并为未来的研究设置了参考分数。数据集、模型和测试服务器将公开发布。
https://arxiv.org/abs/2403.20225
Aiming at the metro video surveillance system has not been able to effectively solve the metro crowd density estimation problem, a Metro Crowd density estimation Network (called MCNet) is proposed to automatically classify crowd density level of passengers. Firstly, an Integrating Multi-scale Attention (IMA) module is proposed to enhance the ability of the plain classifiers to extract semantic crowd texture features to accommodate to the characteristics of the crowd texture feature. The innovation of the IMA module is to fuse the dilation convolution, multiscale feature extraction and attention mechanism to obtain multi-scale crowd feature activation from a larger receptive field with lower computational cost, and to strengthen the crowds activation state of convolutional features in top layers. Secondly, a novel lightweight crowd texture feature extraction network is proposed, which can directly process video frames and automatically extract texture features for crowd density estimation, while its faster image processing speed and fewer network parameters make it flexible to be deployed on embedded platforms with limited hardware resources. Finally, this paper integrates IMA module and the lightweight crowd texture feature extraction network to construct the MCNet, and validate the feasibility of this network on image classification dataset: Cifar10 and four crowd density datasets: PETS2009, Mall, QUT and SH_METRO to validate the MCNet whether can be a suitable solution for crowd density estimation in metro video surveillance where there are image processing challenges such as high density, high occlusion, perspective distortion and limited hardware resources.
旨在解决地铁视频监视系统未能有效解决地铁人群密度估计问题的建议是提出一个地铁人群密度估计网络(被称为MCNet),以自动分类乘客的人群密度水平。首先,提出了一种集成多尺度注意(IMA)模块,以增强普通分类器提取语义人群纹理特征以适应人群纹理特征的特点。IMA模块的创新之处在于将扩散卷积、多尺度特征提取和注意力机制融合在一起,从更大的感受野低计算成本地获得多尺度人群特征激活,并加强底层层级的卷积特征的人流激活状态。其次,提出了一种新颖的轻量级人群纹理特征提取网络,可以直接处理视频帧并自动提取人群密度估计的纹理特征,而其更快的图像处理速度和更少的网络参数使其能灵活地部署在具有有限硬件资源的嵌入式平台上。最后,本文将IMA模块和轻量级人群纹理特征提取网络相结合,构建了MCNet,并通过图像分类数据集:CIFAR10、四个人群密度数据集:PETS2009、Mall、QUT和SH_METRO验证了该网络是否可以成为在地铁视频监视中进行人群密度估计的合适解决方案,即使存在诸如高密度、高遮挡、透视扭曲等图像处理挑战,硬件资源有限。
https://arxiv.org/abs/2403.20173
The Space Domain Awareness (SDA) community routinely tracks satellites in orbit by fitting an orbital state to observations made by the Space Surveillance Network (SSN). In order to fit such orbits, an accurate model of the forces that are acting on the satellite is required. Over the past several decades, high-quality, physics-based models have been developed for satellite state estimation and propagation. These models are exceedingly good at estimating and propagating orbital states for non-maneuvering satellites; however, there are several classes of anomalous accelerations that a satellite might experience which are not well-modeled, such as satellites that use low-thrust electric propulsion to modify their orbit. Physics-Informed Neural Networks (PINNs) are a valuable tool for these classes of satellites as they combine physics models with Deep Neural Networks (DNNs), which are highly expressive and versatile function approximators. By combining a physics model with a DNN, the machine learning model need not learn astrodynamics, which results in more efficient and effective utilization of machine learning resources. This paper details the application of PINNs to estimate the orbital state and a continuous, low-amplitude anomalous acceleration profile for satellites. The PINN is trained to learn the unknown acceleration by minimizing the mean square error of observations. We evaluate the performance of pure physics models with PINNs in terms of their observation residuals and their propagation accuracy beyond the fit span of the observations. For a two-day simulation of a GEO satellite using an unmodeled acceleration profile on the order of $10^{-8} \text{ km/s}^2$, the PINN outperformed the best-fit physics model by orders of magnitude for both observation residuals (123 arcsec vs 1.00 arcsec) as well as propagation accuracy (3860 km vs 164 km after five days).
空间领域意识(SDA)社区通常通过将轨道状态拟合到由空间监视网络(SSN)进行的观测结果来跟踪在轨道上的卫星。为了拟合这样的轨道,需要准确地描述对卫星施加的力的模型。在过去的几十年里,已经开发了高质量、基于物理的卫星状态估计和传播模型。这些模型在估计和传播非机动卫星的轨道状态方面非常出色;然而,卫星可能经历的几种异常加速类型(如使用低推力电推进器修改轨道的卫星)并没有被很好地建模,因此这类卫星的模型存在一定的误差。物理启发的神经网络(PINNs)是这类卫星的宝贵工具,因为它们将物理模型与深度神经网络(DNN)相结合,使得机器学习模型无需学习天体动力学,从而实现更高效和有效的机器学习资源利用。本文详细介绍了PINNs在估计卫星轨道状态和连续低幅度异常加速度剖面方面的应用。PINN通过最小化观测结果的均方误差来学习未知加速度。我们评估了使用无建模加速度剖面在GEO卫星上的物理模型与PINN的性能,以及它们在观测残差和观测外推准确性方面的表现。对于使用未建模加速度剖面模拟的GEO卫星的两天模拟,PINN在观测残差(123 arcsec vs 1.00 arcsec)和观测外推准确性(3860 km vs 164 km)方面均优于最佳拟合物理模型。
https://arxiv.org/abs/2403.19736
Robotic access monitoring of multiple target areas has applications including checkpoint enforcement, surveillance and containment of fire and flood hazards. Monitoring access for a single target region has been successfully modeled as a minimum-cut problem. We generalize this model to support multiple target areas using two approaches: iterating on individual targets and examining the collections of targets holistically. Through simulation we measure the performance of each approach on different scenarios.
机器人对多个目标区域的访问监测具有包括检查点执行、监视和洪水灾害的控制应用。对单一目标区域的监测访问已经被成功建模为最小割问题。我们使用两种方法将这个模型扩展到支持多个目标区域:对单个目标进行迭代,并全面检查目标集合。通过仿真我们测量了每种方法在不同场景下的性能。
https://arxiv.org/abs/2403.19375
Video Anomaly Detection (VAD), aiming to identify abnormalities within a specific context and timeframe, is crucial for intelligent Video Surveillance Systems. While recent deep learning-based VAD models have shown promising results by generating high-resolution frames, they often lack competence in preserving detailed spatial and temporal coherence in video frames. To tackle this issue, we propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task. Specifically, we introduce a two-branch vision transformer network designed to capture deep visual features of video frames, addressing spatial and temporal dimensions responsible for modeling appearance and motion patterns, respectively. The inter-patch relationship in each dimension is decoupled into inter-patch similarity and the order information of each patch. To mitigate memory consumption, we convert the order information prediction task into a multi-label learning problem, and the inter-patch similarity prediction task into a distance matrix regression problem. Comprehensive experiments demonstrate the effectiveness of our method, surpassing pixel-generation-based methods by a significant margin across three public benchmarks. Additionally, our approach outperforms other self-supervised learning-based methods.
视频异常检测(VAD)旨在在特定场景和时间范围内识别异常情况,对于智能视频监控系统至关重要。虽然通过基于深度学习的VAD模型在生成高分辨率图像方面取得了良好的效果,但它们通常在保留视频帧的详细空间和时间连贯性方面缺乏能力。为了解决这个问题,我们提出了一种通过跨分割关系预测任务进行自监督学习的方法。具体来说,我们引入了一个两个分支的视觉Transformer网络,旨在捕捉视频帧的深度视觉特征,分别解决空间和时间维度负责建模表现和运动模式的问题。每个维度的跨分割关系被分解为跨分割相似度和每个补丁的顺序信息。为了减轻内存消耗,我们将顺序信息预测任务转换为多标签学习问题,将跨分割相似性预测任务转换为距离矩阵回归问题。全面的实验证明了我们的方法的有效性,在三个公开基准测试中超越了基于像素生成的方法。此外,我们的方法还优于其他自监督学习方法。
https://arxiv.org/abs/2403.19111
Medical imaging is critical to the diagnosis, surveillance, and treatment of many health conditions, including oncological, neurological, cardiovascular, and musculoskeletal disorders, among others. Radiologists interpret these complex, unstructured images and articulate their assessments through narrative reports that remain largely unstructured. This unstructured narrative must be converted into a structured semantic representation to facilitate secondary applications such as retrospective analyses or clinical decision support. Here, we introduce the Corpus of Annotated Medical Imaging Reports (CAMIR), which includes 609 annotated radiology reports from three imaging modality types: Computed Tomography, Magnetic Resonance Imaging, and Positron Emission Tomography-Computed Tomography. Reports were annotated using an event-based schema that captures clinical indications, lesions, and medical problems. Each event consists of a trigger and multiple arguments, and a majority of the argument types, including anatomy, normalize the spans to pre-defined concepts to facilitate secondary use. CAMIR uniquely combines a granular event structure and concept normalization. To extract CAMIR events, we explored two BERT (Bi-directional Encoder Representation from Transformers)-based architectures, including an existing architecture (mSpERT) that jointly extracts all event information and a multi-step approach (PL-Marker++) that we augmented for the CAMIR schema.
医学影像对于许多疾病的诊断、监测和治疗至关重要,包括癌症、神经系统疾病、心血管疾病和骨骼肌骨骼疾病等。放射科医生解释这些复杂且无结构的图像,并通过非结构化的叙述性报告来阐述他们的评估。这些非结构化的叙述必须转换为结构化的语义表示,以促进后期的应用,如回顾性分析或临床决策支持。在这里,我们介绍了名为“标记医学影像报告集”(CAMIR)的文献集,其中包括3种成像模式类型:计算机断层扫描(CT)、磁共振成像(MRI)和正电子发射断层扫描-计算机断层扫描(PET-CT)的609篇注释过的报告。这些报告使用事件基于的 schema 进行注释,捕捉临床指征、病变和医疗问题。每个事件由触发器和多个论据组成,而且大多数论据类型,包括解剖学,正常化跨度以促进二次使用。CAMIR 独特地结合了粒度事件结构和概念正常化。为了提取 CAMIR 事件,我们探索了两种基于 BERT(双向编码器表示从 transformer)的架构,包括现有的架构(mSpERT)和新方法(PL-Marker++),我们对其进行了增强以适应 CAMIR 模式。
https://arxiv.org/abs/2403.18975
In recent years, remarkable advancements have been achieved in the field of image generation, primarily driven by the escalating demand for high-quality outcomes across various image generation subtasks, such as inpainting, denoising, and super resolution. A major effort is devoted to exploring the application of super-resolution techniques to enhance the quality of low-resolution images. In this context, our method explores in depth the problem of ship image super resolution, which is crucial for coastal and port surveillance. We investigate the opportunity given by the growing interest in text-to-image diffusion models, taking advantage of the prior knowledge that such foundation models have already learned. In particular, we present a diffusion-model-based architecture that leverages text conditioning during training while being class-aware, to best preserve the crucial details of the ships during the generation of the super-resoluted image. Since the specificity of this task and the scarcity availability of off-the-shelf data, we also introduce a large labeled ship dataset scraped from online ship images, mostly from ShipSpotting\footnote{\url{this http URL}} website. Our method achieves more robust results than other deep learning models previously employed for super resolution, as proven by the multiple experiments performed. Moreover, we investigate how this model can benefit downstream tasks, such as classification and object detection, thus emphasizing practical implementation in a real-world scenario. Experimental results show flexibility, reliability, and impressive performance of the proposed framework over state-of-the-art methods for different tasks. The code is available at: this https URL .
近年来,在图像生成领域取得了显著的进步,主要受到高质量图像成果需求的不断增长推动,尤其是在修复、去噪和超分辨率等图像生成子任务方面。大量精力致力于探讨将超分辨率技术应用于增强低分辨率图像质量。在这种情况下,我们的方法深入研究了船舶图像超分辨率问题,这对沿海和港口监视至关重要。我们研究了对于文本到图像扩散模型的增长兴趣,利用其已经获得的知识。特别是,我们提出了一个基于扩散模型的架构,在训练过程中利用文本条件,以保留超分辨率图像中船舶的关键细节。由于这项任务的独特性和可用数据的稀缺性,我们还引入了一个从在线图像网站如ShipSpotting网站收集的大规模标注船舶数据集。我们的方法在超分辨率模型的应用中实现了比之前使用的更稳健的结果,这是通过多次实验证明的。此外,我们还研究了这种模型如何为下游任务(如分类和目标检测)带来好处,从而强调在现实场景中的实际实现。实验结果表明,该框架具有灵活性、可靠性和令人印象深刻的性能,超过目前最先进的方法。代码可在此处下载:https://this URL 。
https://arxiv.org/abs/2403.18370
Applications of an efficient emotion recognition system can be found in several domains such as medicine, driver fatigue surveillance, social robotics, and human-computer interaction. Appraising human emotional states, behaviors, and reactions displayed in real-world settings can be accomplished using latent continuous dimensions. Continuous dimensional models of human affect, such as those based on valence and arousal are more accurate in describing a broad range of spontaneous everyday emotions than more traditional models of discrete stereotypical emotion categories (e.g. happiness, surprise). Most of the prior work on estimating valence and arousal considers laboratory settings and acted data. But, for emotion recognition systems to be deployed and integrated into real-world mobile and computing devices, we need to consider data collected in the world. Action recognition is a domain of Computer Vision that involves capturing complementary information on appearance from still frames and motion between frames. In this paper, we treat emotion recognition from the perspective of action recognition by exploring the application of deep learning architectures specifically designed for action recognition, for continuous affect recognition. We propose a novel three-stream end-to-end deep learning regression pipeline with an attention mechanism, which is an ensemble design based on sub-modules of multiple state-of-the-art action recognition systems. The pipeline constitutes a novel data pre-processing approach with a spatial self-attention mechanism to extract keyframes. The optical flow of high-attention regions of the face is extracted to capture temporal context. AFEW-VA in-the-wild dataset has been used to conduct comparative experiments. Quantitative analysis shows that the proposed model outperforms multiple standard baselines of both emotion recognition and action recognition models.
高效情感识别系统的应用范围存在于医学、驾驶员疲劳监测、社会机器人学和人机交互等多个领域。评估现实场景中的人类情感状态、行为和反应可以使用潜在连续维度。基于愉悦和激情的连续维度模型比更传统的离散刻板情感分类模型更准确地描述广泛的日常自发性情感。在估计情感和情绪方面,大部分先前的研究都集中在实验室环境和已有的数据上。但是,为了将情感识别系统部署并集成到现实世界的移动和计算设备中,我们需要考虑从世界中收集的数据。动作识别是一个计算机视觉领域,涉及从静帧和帧之间的运动中捕捉互补信息。在本文中,我们从动作识别的角度来探讨应用专门为动作识别设计的深度学习架构,进行连续情感识别。我们提出了一个新颖的三流端到端深度学习回归管道,带有一个关注机制,这是基于多个最先进的动作识别系统的子模块的集成设计。该管道构成了一个新颖的数据预处理方法,具有空间自注意机制以提取关键帧。提取人脸高关注区域的光学流,以捕捉时间语境。AFEW-VA野外数据集已用于进行比较实验。定量分析表明,与情绪识别和动作识别模型的多个标准基线相比,所提出的模型具有优异的性能。
https://arxiv.org/abs/2403.16263
This study offers an in-depth analysis of the application and implications of the National Institute of Standards and Technology's AI Risk Management Framework (NIST AI RMF) within the domain of surveillance technologies, particularly facial recognition technology. Given the inherently high-risk and consequential nature of facial recognition systems, our research emphasizes the critical need for a structured approach to risk management in this sector. The paper presents a detailed case study demonstrating the utility of the NIST AI RMF in identifying and mitigating risks that might otherwise remain unnoticed in these technologies. Our primary objective is to develop a comprehensive risk management strategy that advances the practice of responsible AI utilization in feasible, scalable ways. We propose a six-step process tailored to the specific challenges of surveillance technology that aims to produce a more systematic and effective risk management practice. This process emphasizes continual assessment and improvement to facilitate companies in managing AI-related risks more robustly and ensuring ethical and responsible deployment of AI systems. Additionally, our analysis uncovers and discusses critical gaps in the current framework of the NIST AI RMF, particularly concerning its application to surveillance technologies. These insights contribute to the evolving discourse on AI governance and risk management, highlighting areas for future refinement and development in frameworks like the NIST AI RMF.
这项研究深入探讨了美国国家标准和技术研究院(NIST)的人工智能风险管理框架(NIST AI RMF)在监视技术领域的应用和影响,特别是面部识别技术。鉴于面部识别系统固有的高风险和严重性,我们的研究强调了该领域需要结构化风险管理方法的关键性。论文提出了一個詳細的案例研究,展示了NIST AI RMF在发现和减轻这些技术中可能被忽视的风险方面的实用性。我们的主要目标是为实现 Responsible AI Utilization 的实际、可扩展性方法制定全面的风险管理策略。我们提出了一个针对监视技术特定挑战的六步骤过程,旨在促进公司更有效地管理AI相关风险,并确保AI系统的伦理和负责任部署。此外,我们的分析揭示了当前NIST AI RMF框架中关键的不足之处,特别是其对监视技术应用的不足。这些见解为人工智能治理和风险管理的演变对话提供了重要的洞察,强调了对像NIST AI RMF这样的框架未来改进和发展的领域。
https://arxiv.org/abs/2403.15646
Algorithmic decision-making is increasingly being adopted across public higher education. The expansion of data-driven practices by post-secondary institutions has occurred in parallel with the adoption of New Public Management approaches by neoliberal administrations. In this study, we conduct a qualitative analysis of an in-depth ethnographic case study of data and algorithms in use at a public college in Ontario, Canada. We identify the data, algorithms, and outcomes in use at the college. We assess how the college's processes and relationships support those outcomes and the different stakeholders' perceptions of the college's data-driven systems. In addition, we find that the growing reliance on algorithmic decisions leads to increased student surveillance, exacerbation of existing inequities, and the automation of the faculty-student relationship. Finally, we identify a cycle of increased institutional power perpetuated by algorithmic decision-making, and driven by a push towards financial sustainability.
算法决策在公共高等教育中得到了越来越多的采用。随着高等学校的数据驱动实践的发展,新公共管理方法论( neoliberal administration)也得到了采用。在这项研究中,我们对加拿大安大略省的一所公立大学的数据和算法使用进行定性分析。我们识别出学院使用的数据、算法和结果。我们评估了学院的过程和关系如何支持这些结果以及不同利益相关者对学院数据驱动系统的看法。此外,我们发现,对算法决策的日益依赖导致了对学生的过度监控,加剧了现有的不平等,并推动了教师-学生关系的自动化。最后,我们识别了一个由算法决策增加机构权力所驱动的循环,该循环由财务可持续性推动。
https://arxiv.org/abs/2403.13969
Ship detection from satellite imagery using Deep Learning (DL) is an indispensable solution for maritime surveillance. However, applying DL models trained on one dataset to others having differences in spatial resolution and radiometric features requires many adjustments. To overcome this issue, this paper focused on the DL models trained on datasets that consist of different optical images and a combination of radar and optical data. When dealing with a limited number of training images, the performance of DL models via this approach was satisfactory. They could improve 5-20% of average precision, depending on the optical images tested. Likewise, DL models trained on the combined optical and radar dataset could be applied to both optical and radar images. Our experiments showed that the models trained on an optical dataset could be used for radar images, while those trained on a radar dataset offered very poor scores when applied to optical images.
使用 Deep Learning (DL) 从卫星图像中检测船舶是一种不可或缺的海上监视解决方案。然而,将仅在一个数据集上训练的 DL 模型应用于具有不同空间分辨率和辐射特征的另一个数据集需要进行许多调整。为了克服这个问题,本文重点关注基于不同光学图像和雷达与光学数据组合的数据集的 DL 模型。当处理训练图像数量有限时,通过这种方法训练的 DL 模型的性能是满意的。根据测试的光学图像,它们可以提高 5-20%的平均精度。同样,基于光学和雷达数据集训练的 DL 模型可以应用于光学和雷达图像。我们的实验结果表明,基于光学数据集训练的模型可以用于雷达图像,而基于雷达数据集训练的模型在应用于光学图像时得分非常低。
https://arxiv.org/abs/2403.13698
With the robust development of technology, license plate recognition technology can now be properly applied in various scenarios, such as road monitoring, tracking of stolen vehicles, detection at parking lot entrances and exits, and so on. However, the precondition for these applications to function normally is that the license plate must be 'clear' enough to be recognized by the system with the correct license plate number. If the license plate becomes blurred due to some external factors, then the accuracy of recognition will be greatly reduced. Although there are many road surveillance cameras in Taiwan, the quality of most cameras is not good, often leading to the inability to recognize license plate numbers due to low photo resolution. Therefore, this study focuses on using super-resolution technology to process blurred license plates. This study will mainly fine-tune three super-resolution models: Real-ESRGAN, A-ESRGAN, and StarSRGAN, and compare their effectiveness in enhancing the resolution of license plate photos and enabling accurate license plate recognition. By comparing different super-resolution models, it is hoped to find the most suitable model for this task, providing valuable references for future researchers.
随着技术的稳健发展,现在可以在各种场景中正确应用车牌识别技术,如道路监控、追踪被盗车辆、入口和出口的检测等。然而,这些应用正常运行的先决条件是车牌必须足够清晰,以便系统能够正确识别正确的车牌号码。如果车牌因一些外部因素变得模糊,那么识别的准确性将大大降低。尽管台湾有很多道路监控摄像头,但大多数摄像头的质量并不好,通常导致由于低照片分辨率无法识别车牌号码。因此,本研究将专注于使用超分辨率技术处理模糊的车牌。本研究将主要优化三种超分辨率模型:Real-ESRGAN、A-ESRGAN和StarSRGAN,并比较它们在提高车牌照片分辨率并实现准确车牌识别的有效性。通过比较不同超分辨率模型,希望找到最适合这项任务的模型,为未来研究者提供宝贵的参考。
https://arxiv.org/abs/2403.15466
The advancements in computer vision and image processing techniques have led to emergence of new application in the domain of visual surveillance, targeted advertisement, content-based searching, and human-computer interaction etc. Out of the various techniques in computer vision, face analysis, in particular, has gained much attention. Several previous studies have tried to explore different applications of facial feature processing for a variety of tasks, including age and gender classification. However, despite several previous studies having explored the problem, the age and gender classification of in-wild human faces is still far from the achieving the desired levels of accuracy required for real-world applications. This paper, therefore, attempts to bridge this gap by proposing a hybrid model that combines self-attention and BiLSTM approaches for age and gender classification problems. The proposed models performance is compared with several state-of-the-art model proposed so far. An improvement of approximately 10percent and 6percent over the state-of-the-art implementations for age and gender classification, respectively, are noted for the proposed model. The proposed model is thus found to achieve superior performance and is found to provide a more generalized learning. The model can, therefore, be applied as a core classification component in various image processing and computer vision problems.
计算机视觉和图像处理技术的进步在视觉监视、定向广告、内容搜索和人与计算机交互等领域导致了新的应用的出现。在计算机视觉的各种技术中,特别是面部特征分析,已经引起了很多关注。之前的研究已经尝试探索面部特征处理在各种任务中的应用,包括年龄和性别分类。然而,尽管有几次之前的研究探索了这个问题,但野外人类脸部的年龄和性别分类仍然离实现真实应用所需的精度水平还有很长的路要走。因此,本文试图通过提出一种混合模型来弥合这个差距,该模型结合了自注意力机制和BiLSTM方法来进行年龄和性别分类问题。所提出的模型的性能与目前最先进的模型进行了比较。与年龄和性别分类的现有实现相比,分别提高了约10%和6%的性能。因此,本文提出的方法被认为实现了卓越的性能,并且具有更广泛的泛化学习。因此,该模型可以作为一个核心分类组件应用于各种图像处理和计算机视觉问题。
https://arxiv.org/abs/2403.12483
3D sensors, also known as RGB-D sensors, utilize depth images where each pixel measures the distance from the camera to objects, using principles like structured light or time-of-flight. Advances in artificial vision have led to affordable 3D cameras capable of real-time object detection without object movement, surpassing 2D cameras in information depth. These cameras can identify objects of varying colors and reflectivities and are less affected by lighting changes. The described prototype uses RGB-D sensors for bidirectional people counting in venues, aiding security and surveillance in spaces like stadiums or airports. It determines real-time occupancy and checks against maximum capacity, crucial during emergencies. The system includes a RealSense D415 depth camera and a mini-computer running object detection algorithms to count people and a 2D camera for identity verification. The system supports statistical analysis and uses C++, Python, and PHP with OpenCV for image processing, demonstrating a comprehensive approach to monitoring venue occupancy.
3D传感器,也称为RGB-D传感器,利用深度图,其中每个像素测量相机到物体的距离,利用结构光或时间测距等原理。人工智能的进步使得价格实惠的3D相机能够实现实时物体检测,超过2D相机在信息深度方面的表现。这些相机可以识别各种颜色和反射率的物体,对光线变化的影响较小。描述的原型使用RGB-D传感器进行场馆双向人员计数,帮助体育场馆或机场等空间的安保和监控。它实时确定空位并检查最大容量,在紧急情况下至关重要。系统包括一个RealSense D415深度相机和一个运行物体检测算法的迷你计算机,以及一个2D相机用于身份验证。该系统支持统计分析,并使用C++、Python和PHP与OpenCV进行图像处理,展示了全面监测场馆占有率的方法。
https://arxiv.org/abs/2403.12310
The robustness of unmanned aerial vehicle (UAV) tracking is crucial in many tasks like surveillance and robotics. Despite its importance, little attention is paid to the performance of UAV trackers under common corruptions due to lack of a dedicated platform. Addressing this, we propose UAV-C, a large-scale benchmark for assessing robustness of UAV trackers under common corruptions. Specifically, UAV-C is built upon two popular UAV datasets by introducing 18 common corruptions from 4 representative categories including adversarial, sensor, blur, and composite corruptions in different levels. Finally, UAV-C contains more than 10K sequences. To understand the robustness of existing UAV trackers against corruptions, we extensively evaluate 12 representative algorithms on UAV-C. Our study reveals several key findings: 1) Current trackers are vulnerable to corruptions, indicating more attention needed in enhancing the robustness of UAV trackers; 2) When accompanying together, composite corruptions result in more severe degradation to trackers; and 3) While each tracker has its unique performance profile, some trackers may be more sensitive to specific corruptions. By releasing UAV-C, we hope it, along with comprehensive analysis, serves as a valuable resource for advancing the robustness of UAV tracking against corruption. Our UAV-C will be available at this https URL.
无人机(UAV)跟踪的稳健性在许多任务中至关重要,如监视和机器人技术。尽管稳健性非常重要,但很少关注在普通污损下无人机跟踪器的性能。为了解决这个问题,我们提出了UAV-C,一个评估无人机跟踪器在常见污损下的稳健性的大型基准。具体来说,UAV-C基于两个流行的UAV数据集,引入了包括恶意、传感器、模糊和组合污损在内的18种常见污损水平。最后,UAV-C包含超过10K个序列。为了了解现有UAV跟踪器对污损的稳健性,我们详细评估了UAV-C上的12个代表算法。我们的研究揭示了几个关键发现:1)当前的跟踪器对污损非常脆弱,表明需要更多地关注增强UAV跟踪器的稳健性;2)当一起出现时,组合污损会导致跟踪器性能的更严重恶化;3)虽然每个跟踪器都有其独特的性能概况,但有些跟踪器可能对特定污损更加敏感。通过发布UAV-C,我们希望它,再加上全面的分析,成为提高无人机跟踪器对抗污损的有价值的资源。我们的UAV-C将在此链接上可用:https://www.url。
https://arxiv.org/abs/2403.11424
Swarm robots, which are inspired from the way insects behave collectively in order to achieve a common goal, have become a major part of research with applications involving search and rescue, area exploration, surveillance etc. In this paper, we present a swarm of robots that do not require individual extrinsic sensors to sense the environment but instead use a single central camera to locate and map the swarm. The robots can be easily built using readily available components with the main chassis being 3D printed, making the system low-cost, low-maintenance, and easy to replicate. We describe Zutu's hardware and software architecture, the algorithms to map the robots to the real world, and some experiments conducted using four of our robots. Eventually, we conclude the possible applications of our system in research, education, and industries.
群智能机器人,以昆虫集体行动方式为灵感,以实现共同目标,已成为搜索与救援、区域探索和监控等应用领域的关键部分。在本文中,我们介绍了一种不需要个别外部传感器来感知环境的群智能机器人,而是使用一个中央相机来定位和绘制群落。机器人可以使用易于获得的组件轻松构建,主底盘采用3D打印技术制造,使系统具有低成本、低维护性和易复制的特点。我们描述了Zutu的硬件和软件架构、将机器人映射到现实世界的算法以及使用我们其中的四个机器人进行的一些实验。最后,我们总结了我们的系统在研究、教育和产业领域可能的应用。
https://arxiv.org/abs/2403.11252
Reinforcement Learning- (RL-)based motion planning has recently shown the potential to outperform traditional approaches from autonomous navigation to robot manipulation. In this work, we focus on a motion planning task for an evasive target in a partially observable multi-agent adversarial pursuit-evasion games (PEG). These pursuit-evasion problems are relevant to various applications, such as search and rescue operations and surveillance robots, where robots must effectively plan their actions to gather intelligence or accomplish mission tasks while avoiding detection or capture themselves. We propose a hierarchical architecture that integrates a high-level diffusion model to plan global paths responsive to environment data while a low-level RL algorithm reasons about evasive versus global path-following behavior. Our approach outperforms baselines by 51.2% by leveraging the diffusion model to guide the RL algorithm for more efficient exploration and improves the explanability and predictability.
强化学习(RL)为基础的运动规划已经在最近的研究中证明了在自主导航和机器人操作等传统方法之上超越的可能性。在本文中,我们关注一个在部分可观测多代理器对抗追击游戏(PEG)中的避碰目标的运动规划任务。这些追击- evasion 问题与各种应用有关,例如搜救和救援任务以及监视机器人,在这些应用中,机器人必须有效地规划其行动以收集情报或完成任务,同时避免被检测或捕捉。我们提出了一个分层架构,将高层扩散模型与低层强化学习算法相结合,在响应环境数据的同时规划全局路径。我们的方法通过利用扩散模型引导RL算法进行更有效的探索,提高了可解释性和可预测性。与基线相比,我们的方法提高了51.2%的性能。
https://arxiv.org/abs/2403.10794
Cloth-changing person re-identification aims to retrieve and identify spe-cific pedestrians by using cloth-irrelevant features in person cloth-changing scenarios. However, pedestrian images captured by surveillance probes usually contain occlusions in real-world scenarios. The perfor-mance of existing cloth-changing re-identification methods is significantly degraded due to the reduction of discriminative cloth-irrelevant features caused by occlusion. We define cloth-changing person re-identification in occlusion scenarios as occluded cloth-changing person re-identification (Occ-CC-ReID), and to the best of our knowledge, we are the first to pro-pose occluded cloth-changing person re-identification as a new task. We constructed two occluded cloth-changing person re-identification datasets for different occlusion scenarios: Occluded-PRCC and Occluded-LTCC. The datasets can be obtained from the following link: this https URL Re-Identification.
衣物换人识别旨在通过在人员更衣场景中使用与衣物无关的特征来检索和识别特定的行人。然而,由遮挡导致的现实场景中行人图像通常包含遮挡。由于遮挡导致歧视性衣物无关特征的减少,现有衣物换人识别方法的性能 significantly degraded。我们将遮挡场景中的衣物换人识别定义为遮挡衣物换人识别(Occ-CC-ReID),据我们所知,我们第一个提出了遮挡衣物换人识别作为新任务。我们为不同遮挡场景构建了两个遮挡衣物换人识别数据集:Occluded-PRCC和Occluded-LTCC。数据集可以通过以下链接获取:https://this.url。
https://arxiv.org/abs/2403.08557