Flocking is a behavior where multiple agents in a system attempt to stay close to each other while avoiding collision and maintaining a desired formation. This is observed in the natural world and has applications in robotics, including natural disaster search and rescue, wild animal tracking, and perimeter surveillance and patrol. Recently, large language models (LLMs) have displayed an impressive ability to solve various collaboration tasks as individual decision-makers. Solving multi-agent flocking with LLMs would demonstrate their usefulness in situations requiring spatial and decentralized decision-making. Yet, when LLM-powered agents are tasked with implementing multi-agent flocking, they fall short of the desired behavior. After extensive testing, we find that agents with LLMs as individual decision-makers typically opt to converge on the average of their initial positions or diverge from each other. After breaking the problem down, we discover that LLMs cannot understand maintaining a shape or keeping a distance in a meaningful way. Solving multi-agent flocking with LLMs would enhance their ability to understand collaborative spatial reasoning and lay a foundation for addressing more complex multi-agent tasks. This paper discusses the challenges LLMs face in multi-agent flocking and suggests areas for future improvement and research.
集群是一种行为,其中系统中的多个代理试图保持彼此接近,同时避免碰撞并维持所需的队形。这种行为在自然环境中观察到,并且在机器人领域具有应用,包括自然灾害搜索和救援、野生动物跟踪和外围监控和巡逻。最近,大型语言模型(LLMs)表现出作为个体决策者在各种合作任务中令人印象深刻的能力。通过使用LLMs解决多代理器集群问题,将展示它们在需要空间和分散决策的情况下的实用性。然而,当LLM驱动的代理被要求实现多代理器集群时,他们往往无法达到期望的行为。经过广泛的测试,我们发现,拥有LLMs的代理通常会聚拢到它们初始位置的平均值,或者从彼此分离。经过分析,我们发现LLMs无法以有意义的方式理解保持形状或保持距离。通过解决多代理器集群问题使用LLMs将增强它们理解合作空间推理的能力,并为解决更复杂的多代理器任务奠定基础。本文讨论了LLMs在多代理器集群中面临的问题,并提出了未来改进和研究的一些方向。
https://arxiv.org/abs/2404.04752
Unmanned Aerial Vehicles (UAVs) are integral in various sectors like agriculture, surveillance, and logistics, driven by advancements in 5G. However, existing research lacks a comprehensive approach addressing both data freshness and security concerns. In this paper, we address the intricate challenges of data freshness, and security, especially in the context of eavesdropping and jamming in modern UAV networks. Our framework incorporates exponential AoI metrics and emphasizes secrecy rate to tackle eavesdropping and jamming threats. We introduce a transformer-enhanced Deep Reinforcement Learning (DRL) approach to optimize task offloading processes. Comparative analysis with existing algorithms showcases the superiority of our scheme, indicating its promising advancements in UAV network management.
无人机(UAVs)在农业、监视和物流等各个领域中至关重要,得益于5G技术的进步。然而,现有的研究缺乏一种全面的方法来解决数据新鲜度和安全问题。在本文中,我们解决了数据新鲜度和安全问题,尤其是在现代UAV网络中被窃听和干扰的背景下。我们的框架引入了指数化的自适应优化度指标,并着重于保密率来解决窃听和干扰威胁。我们引入了一种基于Transformer的深度强化学习(DRL)方法来优化任务卸载过程。与现有算法进行的比较分析显示,我们的方案具有优越性,表明其在UAV网络管理方面的潜在进展。
https://arxiv.org/abs/2404.04692
In the era of modern technology, object detection using the Gray Level Co-occurrence Matrix (GLCM) extraction method plays a crucial role in object recognition processes. It finds applications in real-time scenarios such as security surveillance and autonomous vehicle navigation, among others. Computational efficiency becomes a critical factor in achieving real-time object detection. Hence, there is a need for a detection model with low complexity and satisfactory accuracy. This research aims to enhance computational efficiency by selecting appropriate features within the GLCM framework. Two classification models, namely K-Nearest Neighbours (K-NN) and Support Vector Machine (SVM), were employed, with the results indicating that K-Nearest Neighbours (K-NN) outperforms SVM in terms of computational complexity. Specifically, K-NN, when utilizing a combination of Correlation, Energy, and Homogeneity features, achieves a 100% accuracy rate with low complexity. Moreover, when using a combination of Energy and Homogeneity features, K-NN attains an almost perfect accuracy level of 99.9889%, while maintaining low complexity. On the other hand, despite SVM achieving 100% accuracy in certain feature combinations, its high or very high complexity can pose challenges, particularly in real-time applications. Therefore, based on the trade-off between accuracy and complexity, the K-NN model with a combination of Correlation, Energy, and Homogeneity features emerges as a more suitable choice for real-time applications that demand high accuracy and low complexity. This research provides valuable insights for optimizing object detection in various applications requiring both high accuracy and rapid responsiveness.
在现代技术时代,利用灰度级共现矩阵(GLCM)提取方法进行物体检测在物体识别过程中起着关键作用。该方法在实时场景中的应用包括安全监控和自动驾驶等。计算效率成为实现实时物体检测的关键因素。因此,在GLCM框架内选择适当的特征是提高计算效率的必要条件。 本研究旨在通过选择适当的GLCM框架内的特征来提高计算效率。采用了两种分类模型,即K-近邻(K-NN)和支持向量机(SVM)。结果表明,K-NN在计算复杂性方面优于SVM。 具体来说,当K-NN结合了相关性、能量和同质性特征时,可以达到100%的准确率,同时具有较低的复杂性。此外,当K-NN结合了能量和同质性特征时,其准确率几乎可以达到99.9889%,而保持较低的复杂性。另一方面,尽管SVM在某些特征组合上可以达到100%的准确率,但它的复杂度高或非常高,因此在实时应用程序中可能会产生挑战,特别是在实时应用程序中。因此,基于准确性和复杂性之间的平衡,结合相关性、能量和同质性特征的K-NN模型在需要高准确性和低复杂性的实时应用程序中成为更合适的选择。 这项研究为优化各种需要高准确性和快速响应的应用程序中的物体检测提供了宝贵的洞见。
https://arxiv.org/abs/2404.04578
Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
基于视频的视觉关系检测任务,如视频场景图生成和视频场景关系检测,在精细视频理解中发挥着重要作用。然而,当前的视频视觉关系检测数据集存在两个主要限制,阻碍了该领域的研究进展。首先,它们没有在多人人际场景中探索复杂的人际互动。其次,现有数据集中的关系类型具有较低级的语义,并且通常可以通过外观或简单的先验信息来识别,而无需详细的空间时间上下文推理。然而,理解人类之间的高级互动对于理解复杂的人际视频(如体育和监视视频)至关重要。为了解决这个问题,我们提出了一个新的视频视觉关系检测任务:视频人际互动检测,并为此构建了一个名为SportsHHI的数据集。SportsHHI包含了篮球和排球运动中的34个高级互动类别。在11,398个关键帧上,有118,075个人体边界框和50,649个互动实例被注释。为了进行基准,我们提出了一个两阶段基线方法,并通过广泛的实验揭示了成功的人际交互检测的关键因素。我们希望SportsHHI能够刺激在视频中的人际互动理解的研究,并推动在视频视觉关系检测中发展空间时间上下文建模技术。
https://arxiv.org/abs/2404.04565
Video summarization is a crucial research area that aims to efficiently browse and retrieve relevant information from the vast amount of video content available today. With the exponential growth of multimedia data, the ability to extract meaningful representations from videos has become essential. Video summarization techniques automatically generate concise summaries by selecting keyframes, shots, or segments that capture the video's essence. This process improves the efficiency and accuracy of various applications, including video surveillance, education, entertainment, and social media. Despite the importance of video summarization, there is a lack of diverse and representative datasets, hindering comprehensive evaluation and benchmarking of algorithms. Existing evaluation metrics also fail to fully capture the complexities of video summarization, limiting accurate algorithm assessment and hindering the field's progress. To overcome data scarcity challenges and improve evaluation, we propose an unsupervised approach that leverages video data structure and information for generating informative summaries. By moving away from fixed annotations, our framework can produce representative summaries effectively. Moreover, we introduce an innovative evaluation pipeline tailored specifically for video summarization. Human participants are involved in the evaluation, comparing our generated summaries to ground truth summaries and assessing their informativeness. This human-centric approach provides valuable insights into the effectiveness of our proposed techniques. Experimental results demonstrate that our training-free framework outperforms existing unsupervised approaches and achieves competitive results compared to state-of-the-art supervised methods.
视频摘要是一个关键的研究领域,旨在有效地浏览和检索当今大量视频内容。随着多媒体数据的指数增长,从视频中提取有意义的表示已成为必不可少的。视频摘要技术通过选择关键帧、镜头或片段来捕捉视频的本质,自动生成简洁的摘要。这个过程提高了各种应用(包括视频监控、教育、娱乐和社交媒体)的效率和准确性。尽管视频摘要非常重要,但缺乏多样且具有代表性的数据集,限制了全面评估和基准测试算法的准确性。现有的评估指标也没有完全捕捉到视频摘要的复杂性,从而限制了对准确算法评估和领域进步的限制。为了克服数据稀缺的挑战,提高评估,我们提出了一个无监督的方法,利用视频数据结构和信息生成有信息的摘要。通过远离固定注释,我们的框架可以有效地生成代表性的摘要。此外,我们还引入了一个专门针对视频摘要的创新评估管道。人类参与者参与了评估,将生成的摘要与真实摘要进行比较,并评估其信息价值。这种以人为中心的方法提供了对我们所提技术的有效性的宝贵见解。实验结果表明,我们的无监督框架优于现有的无监督方法,并且在与最先进的监督方法相比之下取得了竞争力的结果。
https://arxiv.org/abs/2404.04564
Vestibular schwannomas (VS) are benign tumors that are generally managed by active surveillance with MRI examination. To further assist clinical decision-making and avoid overtreatment, an accurate prediction of tumor growth based on longitudinal imaging is highly desirable. In this paper, we introduce DeepGrowth, a deep learning method that incorporates neural fields and recurrent neural networks for prospective tumor growth prediction. In the proposed method, each tumor is represented as a signed distance function (SDF) conditioned on a low-dimensional latent code. Unlike previous studies that perform tumor shape prediction directly in the image space, we predict the latent codes instead and then reconstruct future shapes from it. To deal with irregular time intervals, we introduce a time-conditioned recurrent module based on a ConvLSTM and a novel temporal encoding strategy, which enables the proposed model to output varying tumor shapes over time. The experiments on an in-house longitudinal VS dataset showed that the proposed model significantly improved the performance ($\ge 1.6\%$ Dice score and $\ge0.20$ mm 95\% Hausdorff distance), in particular for top 20\% tumors that grow or shrink the most ($\ge 4.6\%$ Dice score and $\ge 0.73$ mm 95\% Hausdorff distance). Our code is available at ~\burl{this https URL}
球囊瘤(VS)是良性的肿瘤,通常通过MRI检查进行积极监测来管理。为了进一步辅助临床决策并避免过度治疗,准确预测肿瘤生长基于长期影像是非常渴望的。在本文中,我们介绍了DeepGrowth,一种基于神经场和循环神经网络的远期肿瘤生长预测 deep learning 方法。在所提出的系统中,每个肿瘤都被表示为一个条件为低维潜在码的正则距离函数(SDF)。与之前直接在图像空间中进行肿瘤形状预测的研究不同,我们预测潜在码,然后从它中重构未来的形状。为处理不规则的时间间隔,我们引入了一个基于ConvLSTM和一种新颖的时间编码策略的时间条件循环模块,使得所提出的模型能够输出随时间变化的肿瘤形状。在内部 longitudinal VS 数据集上的实验表明,与之前的研究相比,所提出的模型显著提高了性能($\geq1.6\%$ Dice 分数和$\geq0.20$ mm $95\%$ Hausdorff 距离),特别是对于生长或缩小最多的前20%肿瘤($\geq4.6\%$ Dice 分数和$\geq0.73$ mm $95\%$ Hausdorff 距离)。我们的代码可在此处访问:<https://this URL>
https://arxiv.org/abs/2404.02614
This study proposes a novel transfer learning framework for effective ship classification using high-resolution optical remote sensing satellite imagery. The framework is based on the deep convolutional neural network model ResNet50 and incorporates the Convolutional Block Attention Module (CBAM) to enhance performance. CBAM enables the model to attend to salient features in the images, allowing it to better discriminate between subtle differences between ships and backgrounds. Furthermore, this study adopts a transfer learning approach tailored for accurately classifying diverse types of ships by fine-tuning a pre-trained model for the specific task. Experimental results demonstrate the efficacy of the proposed framework in ship classification using optical remote sensing imagery, achieving a high classification accuracy of 94% across 5 classes, outperforming existing methods. This research holds potential applications in maritime surveillance and management, illegal fishing detection, and maritime traffic monitoring.
这项研究提出了一种新颖的迁移学习框架,用于使用高分辨率光学遥感卫星图像有效地对船只进行分类。该框架基于深度卷积神经网络模型ResNet50,并采用了卷积块注意模块(CBAM)来提高性能。CBAM使模型能够关注图像中的显着特征,从而更好地区分船只和背景之间的微妙差异。此外,本研究采用了一种针对准确分类各种船只的迁移学习方法,通过微调预训练模型来适应特定任务。实验结果表明,所提出的框架在光学遥感图像中进行船只分类的有效性,达到5个类别的高分类准确率94%,超过了现有方法。这项研究具有在海上监视和管理、非法捕鱼检测和海上交通监测等领域潜在应用的价值。
https://arxiv.org/abs/2404.02135
Human pose and shape (HPS) estimation with lensless imaging is not only beneficial to privacy protection but also can be used in covert surveillance scenarios due to the small size and simple structure of this device. However, this task presents significant challenges due to the inherent ambiguity of the captured measurements and lacks effective methods for directly estimating human pose and shape from lensless data. In this paper, we propose the first end-to-end framework to recover 3D human poses and shapes from lensless measurements to our knowledge. We specifically design a multi-scale lensless feature decoder to decode the lensless measurements through the optically encoded mask for efficient feature extraction. We also propose a double-head auxiliary supervision mechanism to improve the estimation accuracy of human limb ends. Besides, we establish a lensless imaging system and verify the effectiveness of our method on various datasets acquired by our lensless imaging system.
人类姿态和形状估计(HPS)使用无透镜成像不仅有益于隐私保护,还可以用于秘密监视场景,因为该设备具有小尺寸和简单的结构。然而,由于捕获测量固有的不确定性,以及缺乏直接从无透镜数据中估计人类姿态和形状的有效方法,这项任务带来了显著的挑战。在本文中,我们提出了从无透镜测量中恢复3D人类姿态和形状的端到端框架,这是已知的第一种方法。我们特别设计了一个多尺度无透镜特征解码器,通过光学编码掩码解码无透镜测量以进行有效的特征提取。我们还提出了一个双头辅助监督机制,以提高人体肢体端点的估计精度。此外,我们还建立了无透镜成像系统,并验证了我们在各种由无透镜成像系统获取的数据集上所采用方法的有效性。
https://arxiv.org/abs/2404.01941
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.
视频异常检测(VAD)旨在在视频中查找异常事件。现有的工作主要依赖于训练深度模型来学习与视频级别监督、一类别监督或无监督设置相关的正态分布。基于训练的方法容易受到领域特定的影响,因此在实际部署中,任何领域变化都会涉及数据收集和模型训练,导致成本高昂。在本文中,我们彻底背离了以前的尝试,提出了基于语言的VAD(LAVAD),一种在无训练数据和无监督模型的情况下解决VAD的新方法,利用预训练的大语言模型(LLMs)和现有的视觉语言模型(VLMs)的特性。我们利用VLM的文本描述模型生成每个测试视频每个帧的文本描述。然后,我们设计了一个提示机制,用于解锁LLMs在时间聚合和异常得分估计方面的能力,使LLMs成为有效的视频异常检测器。我们进一步利用模态对齐的VLMs,并提出基于跨模态相似性的有效技术,用于清洗嘈杂的文本和优化基于LLM的异常得分。我们在两个大型数据集(UCF-Crime和XD-Violence)上评估LAVAD,结果表明,在没有训练或数据收集的情况下,它优于无监督和一类别方法。
https://arxiv.org/abs/2404.01014
Unsupervised (US) video anomaly detection (VAD) in surveillance applications is gaining more popularity recently due to its practical real-world applications. As surveillance videos are privacy sensitive and the availability of large-scale video data may enable better US-VAD systems, collaborative learning can be highly rewarding in this setting. However, due to the extremely challenging nature of the US-VAD task, where learning is carried out without any annotations, privacy-preserving collaborative learning of US-VAD systems has not been studied yet. In this paper, we propose a new baseline for anomaly detection capable of localizing anomalous events in complex surveillance videos in a fully unsupervised fashion without any labels on a privacy-preserving participant-based distributed training configuration. Additionally, we propose three new evaluation protocols to benchmark anomaly detection approaches on various scenarios of collaborations and data availability. Based on these protocols, we modify existing VAD datasets to extensively evaluate our approach as well as existing US SOTA methods on two large-scale datasets including UCF-Crime and XD-Violence. All proposed evaluation protocols, dataset splits, and codes are available here: this https URL
近年来,由于其具有实际应用价值,无监督(US)视频异常检测(VAD)在监视领域越来越受到欢迎。由于监视视频对个人隐私敏感,并且大规模视频数据的可用性可能使更好的US-VAD系统更加成熟,在这种场景下,协作学习可能具有很高的价值。然而,由于US-VAD任务的极度困难性,其中学习过程没有标签,因此还没有研究过在保护参与者隐私的分布式训练配置下进行隐私保留的US-VAD系统的协作学习。在本文中,我们提出了一个能够在不进行任何标签的情况下,以完全无监督的方式定位复杂监视视频中的异常事件的新的基线。此外,我们还提出了三个新的评估协议,以评估各种协作场景下的异常检测方法。基于这些协议,我们对包括UCF-Crime和XD-Violence在内的两个大型数据集进行了广泛评估,以评估我们的方法和现有US SOTA方法的性能。所提出的评估协议、数据集拆分和代码都可以在这里找到:https:// this URL
https://arxiv.org/abs/2404.00847
This research introduces an innovative security enhancement approach, employing advanced image analysis and soft computing. The focus is on an intelligent surveillance system that detects unauthorized individuals in restricted areas by analyzing attire. Traditional security measures face challenges in monitoring unauthorized access. Leveraging YOLOv8, an advanced object detection algorithm, our system identifies authorized personnel based on their attire in CCTV footage. The methodology involves training the YOLOv8 model on a comprehensive dataset of uniform patterns, ensuring precise recognition in specific regions. Soft computing techniques enhance adaptability to dynamic environments and varying lighting conditions. This research contributes to image analysis and soft computing, providing a sophisticated security solution. Emphasizing uniform-based anomaly detection, it establishes a foundation for robust security systems in restricted areas. The outcomes highlight the potential of YOLOv8-based surveillance in ensuring safety in sensitive locations.
这项研究介绍了一种创新的安全增强方法,结合了先进的图像分析和软计算。重点是一个智能监控系统,通过分析服饰来检测受限制区域内的未经授权的人员。传统的 security 措施在监测未经授权的访问时面临挑战。利用 YOLOv8,一种先进的物体检测算法,我们的系统根据影片中的服饰来识别授权人员。该方法涉及在全面数据集中训练 YOLOv8 模型,确保在特定区域的精确识别。软计算技术增强了对于动态环境和不断变化的光线条件的适应性。这项研究对于图像分析和软计算领域都做出了贡献,提供了一种高级的安全解决方案。强调了基于一致性的异常检测,为受限制区域建立了一个稳固的安全系统基础。结果表明,基于 YOLOv8 的监控在确保敏感地点的安全方面具有巨大潜力。
https://arxiv.org/abs/2404.00645
Classifying hyperspectral images is a difficult task in remote sensing, due to their complex high-dimensional data. To address this challenge, we propose HSIMamba, a novel framework that uses bidirectional reversed convolutional neural network pathways to extract spectral features more efficiently. Additionally, it incorporates a specialized block for spatial analysis. Our approach combines the operational efficiency of CNNs with the dynamic feature extraction capability of attention mechanisms found in Transformers. However, it avoids the associated high computational demands. HSIMamba is designed to process data bidirectionally, significantly enhancing the extraction of spectral features and integrating them with spatial information for comprehensive analysis. This approach improves classification accuracy beyond current benchmarks and addresses computational inefficiencies encountered with advanced models like Transformers. HSIMamba were tested against three widely recognized datasets Houston 2013, Indian Pines, and Pavia University and demonstrated exceptional performance, surpassing existing state-of-the-art models in HSI classification. This method highlights the methodological innovation of HSIMamba and its practical implications, which are particularly valuable in contexts where computational resources are limited. HSIMamba redefines the standards of efficiency and accuracy in HSI classification, thereby enhancing the capabilities of remote sensing applications. Hyperspectral imaging has become a crucial tool for environmental surveillance, agriculture, and other critical areas that require detailed analysis of the Earth surface. Please see our code in HSIMamba for more details.
分类超分辨率图像是一个遥感中的困难任务,因为它们具有复杂且高维的数据。为解决这个问题,我们提出了HSIMamba,一种新颖的方法,它使用双向反卷积神经网络通道来更有效地提取光谱特征。此外,它还包含一个专门的块来进行空间分析。我们的方法将CNN的操作效率与Transformer中发现的关注机制的动态特征提取能力相结合。然而,它避免了与高级模型(如Transformers)相关的较高计算需求。HSIMamba旨在处理双向数据,从而大大提高提取光谱特征的效率并将其与空间信息相结合进行全面的分析。这种方法在现有基准之外提高了分类准确性,并解决了与高级模型(如Transformers)相关的计算效率问题。HSIMamba与三个广泛认可的数据集Houston 2013,Indian Pines和Pavia University进行了测试,表现出色,超越了现有状态下的最先进模型在HSI分类中的水平。这种方法突出了HSIMamba的方法创新及其实用性,尤其是在计算资源有限的环境下,这一点尤为重要。HSIMamba重新定义了HSI分类的标准和精度,从而提高了遥感应用的功能。超分辨率成像已经成为环境监测、农业和其他需要对地球表面进行详细分析的关键领域的关键技术。更多详情,请参阅HSIMamba中的我们的代码。
https://arxiv.org/abs/2404.00272
Head pose estimation has become a crucial area of research in computer vision given its usefulness in a wide range of applications, including robotics, surveillance, or driver attention monitoring. One of the most difficult challenges in this field is managing head occlusions that frequently take place in real-world scenarios. In this paper, we propose a novel and efficient framework that is robust in real world head occlusion scenarios. In particular, we propose an unsupervised latent embedding clustering with regression and classification components for each pose angle. The model optimizes latent feature representations for occluded and non-occluded images through a clustering term while improving fine-grained angle predictions. Experimental evaluation on in-the-wild head pose benchmark datasets reveal competitive performance in comparison to state-of-the-art methodologies with the advantage of having a significant data reduction. We observe a substantial improvement in occluded head pose estimation. Also, an ablation study is conducted to ascertain the impact of the clustering term within our proposed framework.
头姿态估计已成为计算机视觉领域的一个重要研究课题,因为在广泛的应用领域中具有实用性,包括机器人、监视或驾驶员注意力监测。这个领域中最困难的一个挑战是处理在现实场景中经常发生的头部遮挡。在本文中,我们提出了一个新颖且高效的框架,对真实世界头部遮挡场景具有鲁棒性。特别是,我们提出了一个无监督的潜在嵌入聚类与回归和分类组件,用于每个姿态角度。通过聚类项优化遮挡和非遮挡图像的潜在特征表示,同时改善微小角度预测。在野外头部姿态基准数据集上的实验评估显示,与最先进的methodologies相比具有竞争性的性能,同时数据量减少显著。我们观察到,在提出的框架中,遮挡头部姿态估计取得了很大的改善。此外,我们还进行了一项消融研究,以确定在我们的框架中聚类项的影响。
https://arxiv.org/abs/2403.20251
Multi-target multi-camera tracking is a crucial task that involves identifying and tracking individuals over time using video streams from multiple cameras. This task has practical applications in various fields, such as visual surveillance, crowd behavior analysis, and anomaly detection. However, due to the difficulty and cost of collecting and labeling data, existing datasets for this task are either synthetically generated or artificially constructed within a controlled camera network setting, which limits their ability to model real-world dynamics and generalize to diverse camera configurations. To address this issue, we present MTMMC, a real-world, large-scale dataset that includes long video sequences captured by 16 multi-modal cameras in two different environments - campus and factory - across various time, weather, and season conditions. This dataset provides a challenging test-bed for studying multi-camera tracking under diverse real-world complexities and includes an additional input modality of spatially aligned and temporally synchronized RGB and thermal cameras, which enhances the accuracy of multi-camera tracking. MTMMC is a super-set of existing datasets, benefiting independent fields such as person detection, re-identification, and multiple object tracking. We provide baselines and new learning setups on this dataset and set the reference scores for future studies. The datasets, models, and test server will be made publicly available.
多目标多摄像机跟踪是一个关键任务,涉及使用来自多个摄像头的视频流识别和跟踪一段时间内的个体。这项任务在各种领域具有实际应用,如视频监控、人群行为分析和异常检测。然而,由于收集和标注数据的难度和成本,现有的数据集只能在受控的相机网络环境中构建,这限制了它们对建模现实世界动态和泛化到不同相机配置的能力。为了解决这个问题,我们提出了MTMMC,一个真实世界的大型数据集,其中包括16个多模态相机在校园和工厂环境中捕获的长时间视频序列。这个数据集为研究不同现实世界复杂性下的多相机跟踪提供了具有挑战性的测试平台,包括额外的输入模态:空间对齐和时间同步的RGB和热成像相机,从而提高了多相机跟踪的准确性。MTMMC是现有数据集中的超集, benefit independent fields such as person detection, re-identification, and multiple object tracking。我们在该数据集上提供了基线和新学习设置,并为未来的研究设置了参考分数。数据集、模型和测试服务器将公开发布。
https://arxiv.org/abs/2403.20225
Aiming at the metro video surveillance system has not been able to effectively solve the metro crowd density estimation problem, a Metro Crowd density estimation Network (called MCNet) is proposed to automatically classify crowd density level of passengers. Firstly, an Integrating Multi-scale Attention (IMA) module is proposed to enhance the ability of the plain classifiers to extract semantic crowd texture features to accommodate to the characteristics of the crowd texture feature. The innovation of the IMA module is to fuse the dilation convolution, multiscale feature extraction and attention mechanism to obtain multi-scale crowd feature activation from a larger receptive field with lower computational cost, and to strengthen the crowds activation state of convolutional features in top layers. Secondly, a novel lightweight crowd texture feature extraction network is proposed, which can directly process video frames and automatically extract texture features for crowd density estimation, while its faster image processing speed and fewer network parameters make it flexible to be deployed on embedded platforms with limited hardware resources. Finally, this paper integrates IMA module and the lightweight crowd texture feature extraction network to construct the MCNet, and validate the feasibility of this network on image classification dataset: Cifar10 and four crowd density datasets: PETS2009, Mall, QUT and SH_METRO to validate the MCNet whether can be a suitable solution for crowd density estimation in metro video surveillance where there are image processing challenges such as high density, high occlusion, perspective distortion and limited hardware resources.
旨在解决地铁视频监视系统未能有效解决地铁人群密度估计问题的建议是提出一个地铁人群密度估计网络(被称为MCNet),以自动分类乘客的人群密度水平。首先,提出了一种集成多尺度注意(IMA)模块,以增强普通分类器提取语义人群纹理特征以适应人群纹理特征的特点。IMA模块的创新之处在于将扩散卷积、多尺度特征提取和注意力机制融合在一起,从更大的感受野低计算成本地获得多尺度人群特征激活,并加强底层层级的卷积特征的人流激活状态。其次,提出了一种新颖的轻量级人群纹理特征提取网络,可以直接处理视频帧并自动提取人群密度估计的纹理特征,而其更快的图像处理速度和更少的网络参数使其能灵活地部署在具有有限硬件资源的嵌入式平台上。最后,本文将IMA模块和轻量级人群纹理特征提取网络相结合,构建了MCNet,并通过图像分类数据集:CIFAR10、四个人群密度数据集:PETS2009、Mall、QUT和SH_METRO验证了该网络是否可以成为在地铁视频监视中进行人群密度估计的合适解决方案,即使存在诸如高密度、高遮挡、透视扭曲等图像处理挑战,硬件资源有限。
https://arxiv.org/abs/2403.20173
The Space Domain Awareness (SDA) community routinely tracks satellites in orbit by fitting an orbital state to observations made by the Space Surveillance Network (SSN). In order to fit such orbits, an accurate model of the forces that are acting on the satellite is required. Over the past several decades, high-quality, physics-based models have been developed for satellite state estimation and propagation. These models are exceedingly good at estimating and propagating orbital states for non-maneuvering satellites; however, there are several classes of anomalous accelerations that a satellite might experience which are not well-modeled, such as satellites that use low-thrust electric propulsion to modify their orbit. Physics-Informed Neural Networks (PINNs) are a valuable tool for these classes of satellites as they combine physics models with Deep Neural Networks (DNNs), which are highly expressive and versatile function approximators. By combining a physics model with a DNN, the machine learning model need not learn astrodynamics, which results in more efficient and effective utilization of machine learning resources. This paper details the application of PINNs to estimate the orbital state and a continuous, low-amplitude anomalous acceleration profile for satellites. The PINN is trained to learn the unknown acceleration by minimizing the mean square error of observations. We evaluate the performance of pure physics models with PINNs in terms of their observation residuals and their propagation accuracy beyond the fit span of the observations. For a two-day simulation of a GEO satellite using an unmodeled acceleration profile on the order of $10^{-8} \text{ km/s}^2$, the PINN outperformed the best-fit physics model by orders of magnitude for both observation residuals (123 arcsec vs 1.00 arcsec) as well as propagation accuracy (3860 km vs 164 km after five days).
空间领域意识(SDA)社区通常通过将轨道状态拟合到由空间监视网络(SSN)进行的观测结果来跟踪在轨道上的卫星。为了拟合这样的轨道,需要准确地描述对卫星施加的力的模型。在过去的几十年里,已经开发了高质量、基于物理的卫星状态估计和传播模型。这些模型在估计和传播非机动卫星的轨道状态方面非常出色;然而,卫星可能经历的几种异常加速类型(如使用低推力电推进器修改轨道的卫星)并没有被很好地建模,因此这类卫星的模型存在一定的误差。物理启发的神经网络(PINNs)是这类卫星的宝贵工具,因为它们将物理模型与深度神经网络(DNN)相结合,使得机器学习模型无需学习天体动力学,从而实现更高效和有效的机器学习资源利用。本文详细介绍了PINNs在估计卫星轨道状态和连续低幅度异常加速度剖面方面的应用。PINN通过最小化观测结果的均方误差来学习未知加速度。我们评估了使用无建模加速度剖面在GEO卫星上的物理模型与PINN的性能,以及它们在观测残差和观测外推准确性方面的表现。对于使用未建模加速度剖面模拟的GEO卫星的两天模拟,PINN在观测残差(123 arcsec vs 1.00 arcsec)和观测外推准确性(3860 km vs 164 km)方面均优于最佳拟合物理模型。
https://arxiv.org/abs/2403.19736
Robotic access monitoring of multiple target areas has applications including checkpoint enforcement, surveillance and containment of fire and flood hazards. Monitoring access for a single target region has been successfully modeled as a minimum-cut problem. We generalize this model to support multiple target areas using two approaches: iterating on individual targets and examining the collections of targets holistically. Through simulation we measure the performance of each approach on different scenarios.
机器人对多个目标区域的访问监测具有包括检查点执行、监视和洪水灾害的控制应用。对单一目标区域的监测访问已经被成功建模为最小割问题。我们使用两种方法将这个模型扩展到支持多个目标区域:对单个目标进行迭代,并全面检查目标集合。通过仿真我们测量了每种方法在不同场景下的性能。
https://arxiv.org/abs/2403.19375
Video Anomaly Detection (VAD), aiming to identify abnormalities within a specific context and timeframe, is crucial for intelligent Video Surveillance Systems. While recent deep learning-based VAD models have shown promising results by generating high-resolution frames, they often lack competence in preserving detailed spatial and temporal coherence in video frames. To tackle this issue, we propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task. Specifically, we introduce a two-branch vision transformer network designed to capture deep visual features of video frames, addressing spatial and temporal dimensions responsible for modeling appearance and motion patterns, respectively. The inter-patch relationship in each dimension is decoupled into inter-patch similarity and the order information of each patch. To mitigate memory consumption, we convert the order information prediction task into a multi-label learning problem, and the inter-patch similarity prediction task into a distance matrix regression problem. Comprehensive experiments demonstrate the effectiveness of our method, surpassing pixel-generation-based methods by a significant margin across three public benchmarks. Additionally, our approach outperforms other self-supervised learning-based methods.
视频异常检测(VAD)旨在在特定场景和时间范围内识别异常情况,对于智能视频监控系统至关重要。虽然通过基于深度学习的VAD模型在生成高分辨率图像方面取得了良好的效果,但它们通常在保留视频帧的详细空间和时间连贯性方面缺乏能力。为了解决这个问题,我们提出了一种通过跨分割关系预测任务进行自监督学习的方法。具体来说,我们引入了一个两个分支的视觉Transformer网络,旨在捕捉视频帧的深度视觉特征,分别解决空间和时间维度负责建模表现和运动模式的问题。每个维度的跨分割关系被分解为跨分割相似度和每个补丁的顺序信息。为了减轻内存消耗,我们将顺序信息预测任务转换为多标签学习问题,将跨分割相似性预测任务转换为距离矩阵回归问题。全面的实验证明了我们的方法的有效性,在三个公开基准测试中超越了基于像素生成的方法。此外,我们的方法还优于其他自监督学习方法。
https://arxiv.org/abs/2403.19111
Medical imaging is critical to the diagnosis, surveillance, and treatment of many health conditions, including oncological, neurological, cardiovascular, and musculoskeletal disorders, among others. Radiologists interpret these complex, unstructured images and articulate their assessments through narrative reports that remain largely unstructured. This unstructured narrative must be converted into a structured semantic representation to facilitate secondary applications such as retrospective analyses or clinical decision support. Here, we introduce the Corpus of Annotated Medical Imaging Reports (CAMIR), which includes 609 annotated radiology reports from three imaging modality types: Computed Tomography, Magnetic Resonance Imaging, and Positron Emission Tomography-Computed Tomography. Reports were annotated using an event-based schema that captures clinical indications, lesions, and medical problems. Each event consists of a trigger and multiple arguments, and a majority of the argument types, including anatomy, normalize the spans to pre-defined concepts to facilitate secondary use. CAMIR uniquely combines a granular event structure and concept normalization. To extract CAMIR events, we explored two BERT (Bi-directional Encoder Representation from Transformers)-based architectures, including an existing architecture (mSpERT) that jointly extracts all event information and a multi-step approach (PL-Marker++) that we augmented for the CAMIR schema.
医学影像对于许多疾病的诊断、监测和治疗至关重要,包括癌症、神经系统疾病、心血管疾病和骨骼肌骨骼疾病等。放射科医生解释这些复杂且无结构的图像,并通过非结构化的叙述性报告来阐述他们的评估。这些非结构化的叙述必须转换为结构化的语义表示,以促进后期的应用,如回顾性分析或临床决策支持。在这里,我们介绍了名为“标记医学影像报告集”(CAMIR)的文献集,其中包括3种成像模式类型:计算机断层扫描(CT)、磁共振成像(MRI)和正电子发射断层扫描-计算机断层扫描(PET-CT)的609篇注释过的报告。这些报告使用事件基于的 schema 进行注释,捕捉临床指征、病变和医疗问题。每个事件由触发器和多个论据组成,而且大多数论据类型,包括解剖学,正常化跨度以促进二次使用。CAMIR 独特地结合了粒度事件结构和概念正常化。为了提取 CAMIR 事件,我们探索了两种基于 BERT(双向编码器表示从 transformer)的架构,包括现有的架构(mSpERT)和新方法(PL-Marker++),我们对其进行了增强以适应 CAMIR 模式。
https://arxiv.org/abs/2403.18975
In recent years, remarkable advancements have been achieved in the field of image generation, primarily driven by the escalating demand for high-quality outcomes across various image generation subtasks, such as inpainting, denoising, and super resolution. A major effort is devoted to exploring the application of super-resolution techniques to enhance the quality of low-resolution images. In this context, our method explores in depth the problem of ship image super resolution, which is crucial for coastal and port surveillance. We investigate the opportunity given by the growing interest in text-to-image diffusion models, taking advantage of the prior knowledge that such foundation models have already learned. In particular, we present a diffusion-model-based architecture that leverages text conditioning during training while being class-aware, to best preserve the crucial details of the ships during the generation of the super-resoluted image. Since the specificity of this task and the scarcity availability of off-the-shelf data, we also introduce a large labeled ship dataset scraped from online ship images, mostly from ShipSpotting\footnote{\url{this http URL}} website. Our method achieves more robust results than other deep learning models previously employed for super resolution, as proven by the multiple experiments performed. Moreover, we investigate how this model can benefit downstream tasks, such as classification and object detection, thus emphasizing practical implementation in a real-world scenario. Experimental results show flexibility, reliability, and impressive performance of the proposed framework over state-of-the-art methods for different tasks. The code is available at: this https URL .
近年来,在图像生成领域取得了显著的进步,主要受到高质量图像成果需求的不断增长推动,尤其是在修复、去噪和超分辨率等图像生成子任务方面。大量精力致力于探讨将超分辨率技术应用于增强低分辨率图像质量。在这种情况下,我们的方法深入研究了船舶图像超分辨率问题,这对沿海和港口监视至关重要。我们研究了对于文本到图像扩散模型的增长兴趣,利用其已经获得的知识。特别是,我们提出了一个基于扩散模型的架构,在训练过程中利用文本条件,以保留超分辨率图像中船舶的关键细节。由于这项任务的独特性和可用数据的稀缺性,我们还引入了一个从在线图像网站如ShipSpotting网站收集的大规模标注船舶数据集。我们的方法在超分辨率模型的应用中实现了比之前使用的更稳健的结果,这是通过多次实验证明的。此外,我们还研究了这种模型如何为下游任务(如分类和目标检测)带来好处,从而强调在现实场景中的实际实现。实验结果表明,该框架具有灵活性、可靠性和令人印象深刻的性能,超过目前最先进的方法。代码可在此处下载:https://this URL 。
https://arxiv.org/abs/2403.18370