Current methods for 3D reconstruction and environmental mapping frequently face challenges in achieving high precision, highlighting the need for practical and effective solutions. In response to this issue, our study introduces FlyNeRF, a system integrating Neural Radiance Fields (NeRF) with drone-based data acquisition for high-quality 3D reconstruction. Utilizing unmanned aerial vehicle (UAV) for capturing images and corresponding spatial coordinates, the obtained data is subsequently used for the initial NeRF-based 3D reconstruction of the environment. Further evaluation of the reconstruction render quality is accomplished by the image evaluation neural network developed within the scope of our system. According to the results of the image evaluation module, an autonomous algorithm determines the position for additional image capture, thereby improving the reconstruction quality. The neural network introduced for render quality assessment demonstrates an accuracy of 97%. Furthermore, our adaptive methodology enhances the overall reconstruction quality, resulting in an average improvement of 2.5 dB in Peak Signal-to-Noise Ratio (PSNR) for the 10% quantile. The FlyNeRF demonstrates promising results, offering advancements in such fields as environmental monitoring, surveillance, and digital twins, where high-fidelity 3D reconstructions are crucial.
目前用于3D建模和环境建模的方法通常很难实现高精度,这凸显了需要实际有效的解决方案。为了应对这个问题,我们的研究引入了FlyNeRF,一种将神经辐射场(NeRF)与无人机数据采集相结合的高质量3D建模系统。利用无人机捕获图像和相关空间坐标,然后将获得的数据用于环境中的最初NeRF-based 3D建模。通过系统内图像评估神经网络进一步评估建模渲染质量。根据图像评估模块的结果,自适应算法确定附加图像捕捉的位置,从而提高建模质量。用于建模质量评估的神经网络表现出97%的准确度。此外,我们的自适应方法提高了整体建模质量,使得10%分位数上的峰值信号-噪声比(PSNR)平均提高了2.5分贝。FlyNeRF显示出鼓舞人心的结果,为环境监测、监视和数字孪生等领域提供了进步,这些领域对高保真3D建模至关重要。
https://arxiv.org/abs/2404.12970
The Segment Anything Model (SAM) is a deep neural network foundational model designed to perform instance segmentation which has gained significant popularity given its zero-shot segmentation ability. SAM operates by generating masks based on various input prompts such as text, bounding boxes, points, or masks, introducing a novel methodology to overcome the constraints posed by dataset-specific scarcity. While SAM is trained on an extensive dataset, comprising ~11M images, it mostly consists of natural photographic images with only very limited images from other modalities. Whilst the rapid progress in visual infrared surveillance and X-ray security screening imaging technologies, driven forward by advances in deep learning, has significantly enhanced the ability to detect, classify and segment objects with high accuracy, it is not evident if the SAM zero-shot capabilities can be transferred to such modalities. This work assesses SAM capabilities in segmenting objects of interest in the X-ray/infrared modalities. Our approach reuses the pre-trained SAM with three different prompts: bounding box, centroid and random points. We present quantitative/qualitative results to showcase the performance on selected datasets. Our results show that SAM can segment objects in the X-ray modality when given a box prompt, but its performance varies for point prompts. Specifically, SAM performs poorly in segmenting slender objects and organic materials, such as plastic bottles. We find that infrared objects are also challenging to segment with point prompts given the low-contrast nature of this modality. This study shows that while SAM demonstrates outstanding zero-shot capabilities with box prompts, its performance ranges from moderate to poor for point prompts, indicating that special consideration on the cross-modal generalisation of SAM is needed when considering use on X-ray/infrared imagery.
Segment Anything Model (SAM)是一种深度神经网络基础模型,旨在执行实例分割,由于其零 shot分割能力而获得了显著的流行。SAM通过根据各种输入提示生成掩码来操作,引入了一种新的方法来克服数据集特异性稀疏性的限制。尽管SAM在广泛的训练数据集上进行训练,包括~11M张图像,但它主要由仅包含非常有限其他模态图像的自然摄影图像组成。尽管随着深度学习技术的进步,视觉红外监视和X射线安全筛选成像技术的发展,大大提高了检测、分类和分割物体的准确性,但目前尚不清楚SAM的零 shot分割能力是否可以应用到这种模态。 本文评估了SAM在X-ray/红外模态中分割物体的能力。我们的方法重用了预训练的SAM,并使用三种不同的提示:边界框、中心点和随机点。我们提供了定量/定性结果,以展示SAM在这些选定数据集上的性能。我们的结果表明,当给定边界框提示时,SAM可以在X-ray模态上分割物体,但性能因点提示而异。具体来说,SAM在分割细长物体和有机材料(如塑料瓶)方面表现不佳。我们发现,由于这种模态的低对比度性质,红外物体也难以通过点提示进行分割。 本研究显示,尽管SAM在边界框提示下表现出出色的零 shot能力,但其在点提示下的性能从中等到差,表明在考虑在X-ray/红外图像上使用SAM时,需要特别注意跨模态通用性。
https://arxiv.org/abs/2404.12285
Surveillance footage represents a valuable resource and opportunities for conducting gait analysis. However, the typical low quality and high noise levels in such footage can severely impact the accuracy of pose estimation algorithms, which are foundational for reliable gait analysis. Existing literature suggests a direct correlation between the efficacy of pose estimation and the subsequent gait analysis results. A common mitigation strategy involves fine-tuning pose estimation models on noisy data to improve robustness. However, this approach may degrade the downstream model's performance on the original high-quality data, leading to a trade-off that is undesirable in practice. We propose a processing pipeline that incorporates a task-targeted artifact correction model specifically designed to pre-process and enhance surveillance footage before pose estimation. Our artifact correction model is optimized to work alongside a state-of-the-art pose estimation network, HRNet, without requiring repeated fine-tuning of the pose estimation model. Furthermore, we propose a simple and robust method for obtaining low quality videos that are annotated with poses in an automatic manner with the purpose of training the artifact correction model. We systematically evaluate the performance of our artifact correction model against a range of noisy surveillance data and demonstrate that our approach not only achieves improved pose estimation on low-quality surveillance footage, but also preserves the integrity of the pose estimation on high resolution footage. Our experiments show a clear enhancement in gait analysis performance, supporting the viability of the proposed method as a superior alternative to direct fine-tuning strategies. Our contributions pave the way for more reliable gait analysis using surveillance data in real-world applications, regardless of data quality.
监视视频资料是一种宝贵的资源和进行姿态分析的机会。然而,这类视频的低质量和高噪声水平可能会严重影响姿态估计算法的准确性,这些算法是可靠姿态分析的基础。现有文献表明,姿态估计的有效性与后续的姿态分析结果之间存在直接关系。一种常见的缓解策略是在噪声数据上对姿态估计模型进行微调,以提高稳健性。然而,这种方法可能会在原始高质量数据上降低下游模型的性能,导致在实践中不必要的权衡。我们提出了一个处理流程,其中包含一个专门针对任务目标进行预处理和增强的监视视频处理模型。我们的预处理和增强模型与最先进的姿态估计网络——HRNet——协同工作,无需反复微调姿态估计模型。此外,我们提出了一种简单而鲁棒的方法,用于自动标注带有姿态的低质量视频,以训练预处理和增强模型。我们系统地评估了我们的预处理模型的性能,并证明我们的方法不仅能在低质量监视视频上实现 improved pose estimation,还能在高质量视频上保留姿态估计的完整性。我们的实验显示,我们的预处理模型在姿态分析性能上明显增强,支持了所提出的利用监视数据进行更可靠姿态分析作为直接微调策略的替代品。我们的贡献为使用监视数据进行更可靠姿态分析在现实应用中铺平道路,而无需考虑数据质量。
https://arxiv.org/abs/2404.12183
The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
跨对象跟踪任务的新趋势是使用自然语言跟踪感兴趣的对象。然而,缺乏成对提示实例数据会阻碍其进展。为解决这个问题,我们提出了一个高质量但成本低的数据生成方法,基于Unreal Engine 5,并构建了一个名为Refer-UE-City的新基准数据集,主要包括路口监视视频的场景,详细描述了人和车辆的外观和行为。具体来说,它提供了14个视频,总共有714个表情,与Refer-KITTI数据集的规模相当。此外,我们提出了一个多层语义引导多对象框架MLS-Track,通过引入语义引导模块(SGM)和语义相关分支(SCB)来增强模型和文本之间的交互。对Refer-UE-City和Refer-KITTI数据集的实验表明,我们提出的框架的有效性得到了证明,并且实现了最先进的性能。代码和数据集将可用。
https://arxiv.org/abs/2404.12031
Motivated by the need to improve model performance in traffic monitoring tasks with limited labeled samples, we propose a straightforward augmentation technique tailored for object detection datasets, specifically designed for stationary camera-based applications. Our approach focuses on placing objects in the same positions as the originals to ensure its effectiveness. By applying in-place augmentation on objects from the same camera input image, we address the challenge of overlapping with original and previously selected objects. Through extensive testing on two traffic monitoring datasets, we illustrate the efficacy of our augmentation strategy in improving model performance, particularly in scenarios with limited labeled samples and imbalanced class distributions. Notably, our method achieves comparable performance to models trained on the entire dataset while utilizing only 8.5 percent of the original data. Moreover, we report significant improvements, with mAP@.5 increasing from 0.4798 to 0.5025, and the mAP@.5:.95 rising from 0.29 to 0.3138 on the FishEye8K dataset. These results highlight the potential of our augmentation approach in enhancing object detection models for traffic monitoring applications.
为了在有限标注样本的情况下提高交通监测任务的模型性能,我们提出了一个专门针对物体检测数据集的简单增强技术,尤其针对静止相机应用。我们的方法专注于将物体放置在原始位置相同的位置,以确保其有效性。通过在同一相机输入图像上的对象进行原地增强,我们解决了与原始和之前选择的对象重叠的挑战。在两个交通监测数据集上进行广泛的测试,我们证明了我们在增强策略上取得优异性能,特别是在有限标注样本和类别分布不均衡的场景中。值得注意的是,我们的方法在只使用原始数据的8.5%的情况下,实现了与整个数据集训练的模型相当的表现。此外,我们报道了显著的改进,其中mAP@.5从0.4798增加到0.5025,mAP@.5:.95从0.29增加到0.3138在FishEye8K数据集上。这些结果突出了我们在增强交通监测应用中的物体检测模型潜力。
https://arxiv.org/abs/2404.11226
Human pose estimation faces hurdles in real-world applications due to factors like lighting changes, occlusions, and cluttered environments. We introduce a unique RGB-Thermal Nearly Paired and Annotated 2D Pose Dataset, comprising over 2,400 high-quality LWIR (thermal) images. Each image is meticulously annotated with 2D human poses, offering a valuable resource for researchers and practitioners. This dataset, captured from seven actors performing diverse everyday activities like sitting, eating, and walking, facilitates pose estimation on occlusion and other challenging scenarios. We benchmark state-of-the-art pose estimation methods on the dataset to showcase its potential, establishing a strong baseline for future research. Our results demonstrate the dataset's effectiveness in promoting advancements in pose estimation for various applications, including surveillance, healthcare, and sports analytics. The dataset and code are available at this https URL
由于因素如光照变化、遮挡和杂乱环境,人体姿态估计在现实应用中面临挑战。我们引入了一个独特的RGB-Thermal Nearly Paired和注释的2D人体姿态数据集,包括超过2400张高质量的LWIR(热)图像。每张图像都精心注释了2D人体姿态,为研究者和技术人员提供了一个宝贵的资源。这个数据集从七名演员在多样日常活动(如坐、吃、走)中拍摄获取,有助于在遮挡和其他具有挑战性的场景中进行姿态估计。我们在数据集上对最先进的姿态估计方法进行基准,以展示其潜力,并为未来的研究建立了一个强大的基线。我们的结果表明,该数据集在推动各种应用中人体姿态估计的进步方面非常有效,包括监视、医疗和体育分析。数据集和代码都可以在以下链接中找到:https://www.链接
https://arxiv.org/abs/2404.10212
Traffic video description and analysis have received much attention recently due to the growing demand for efficient and reliable urban surveillance systems. Most existing methods only focus on locating traffic event segments, which severely lack descriptive details related to the behaviour and context of all the subjects of interest in the events. In this paper, we present TrafficVLM, a novel multi-modal dense video captioning model for vehicle ego camera view. TrafficVLM models traffic video events at different levels of analysis, both spatially and temporally, and generates long fine-grained descriptions for the vehicle and pedestrian at different phases of the event. We also propose a conditional component for TrafficVLM to control the generation outputs and a multi-task fine-tuning paradigm to enhance TrafficVLM's learning capability. Experiments show that TrafficVLM performs well on both vehicle and overhead camera views. Our solution achieved outstanding results in Track 2 of the AI City Challenge 2024, ranking us third in the challenge standings. Our code is publicly available at this https URL.
近年来,由于对高效且可靠的城郊监控系统需求的不断增加,交通视频描述和分析得到了广泛关注。目前,大多数现有方法仅关注于定位交通事件段,这严重缺乏与所有感兴趣对象的行为和上下文相关的详细描述。在本文中,我们提出了TrafficVLM,一种用于车辆自相机视场的多模态密集视频标注模型。TrafficVLM在不同的分析和空间水平上对交通视频事件进行建模,并生成不同事件阶段车辆和行人的详细描述。我们还提出了一种条件组件,用于控制TrafficVLM的生成输出,以及一种多任务微调范式,以增强TrafficVLM的学习能力。实验证明,TrafficVLM在车辆和 overhead 相机视图上表现出色。我们的解决方案在2024年AI城市挑战赛的第二部分(Track 2)中取得了突出成绩,排名第三。我们的代码公开可用,位于此链接:https://www.example.com。
https://arxiv.org/abs/2404.09275
Computer vision, particularly vehicle and pedestrian identification is critical to the evolution of autonomous driving, artificial intelligence, and video surveillance. Current traffic monitoring systems confront major difficulty in recognizing small objects and pedestrians effectively in real-time, posing a serious risk to public safety and contributing to traffic inefficiency. Recognizing these difficulties, our project focuses on the creation and validation of an advanced deep-learning framework capable of processing complex visual input for precise, real-time recognition of cars and people in a variety of environmental situations. On a dataset representing complicated urban settings, we trained and evaluated different versions of the YOLOv8 and RT-DETR models. The YOLOv8 Large version proved to be the most effective, especially in pedestrian recognition, with great precision and robustness. The results, which include Mean Average Precision and recall rates, demonstrate the model's ability to dramatically improve traffic monitoring and safety. This study makes an important addition to real-time, reliable detection in computer vision, establishing new benchmarks for traffic management systems.
计算机视觉,特别是车辆和行人识别,对自动驾驶、人工智能和视频监控技术的演变至关重要。目前,实时交通监测系统在实时识别小型物体和行人方面面临重大困难,这给公共安全和交通效率带来了严重威胁。为了克服这些困难,我们的项目专注于创建和验证一个先进的深度学习框架,能够处理复杂的视觉输入,精确、实时地识别各种环境中的车辆和行人。在一个复杂的都市数据集中,我们训练和评估了不同版本的YOLOv8和RT-DETR模型。YOLOv8大版本被证明是最有效的,特别是在行人识别方面,具有很高的精确度和鲁棒性。包括平均精度均值(Mean Average Precision,MAP)和召回率在内的结果表明,模型具有显著提高交通监测和安全的潜力。这项研究在实时、可靠的计算机视觉检测方面做出了重要的贡献,为交通管理系统树立了新的基准。
https://arxiv.org/abs/2404.08081
Multi-robot target tracking finds extensive applications in different scenarios, such as environmental surveillance and wildfire management, which require the robustness of the practical deployment of multi-robot systems in uncertain and dangerous environments. Traditional approaches often focus on the performance of tracking accuracy with no modeling and assumption of the environments, neglecting potential environmental hazards which result in system failures in real-world deployments. To address this challenge, we investigate multi-robot target tracking in the adversarial environment considering sensing and communication attacks with uncertainty. We design specific strategies to avoid different danger zones and proposed a multi-agent tracking framework under the perilous environment. We approximate the probabilistic constraints and formulate practical optimization strategies to address computational challenges efficiently. We evaluate the performance of our proposed methods in simulations to demonstrate the ability of robots to adjust their risk-aware behaviors under different levels of environmental uncertainty and risk confidence. The proposed method is further validated via real-world robot experiments where a team of drones successfully track dynamic ground robots while being risk-aware of the sensing and/or communication danger zones.
多机器人目标跟踪在不同的场景中具有广泛的应用,如环境监视和野火管理,这些场景需要多机器人系统在不确定和危险的环境中的实际部署的稳健性。传统的解决方案通常关注跟踪准确性的性能,没有对环境进行建模和假设,忽视了在现实部署中导致系统故障的潜在环境危险。为了应对这个挑战,我们研究在不确定环境中进行多机器人目标跟踪,考虑了感知和通信攻击的不确定性。我们设计了一些特定的策略来避开不同的危险区域,并针对危险环境提出了一种多代理跟踪框架。我们近似概率约束并提出了有效的优化策略来应对计算挑战。我们在仿真中评估了我们提出方法的性能,以证明机器人在不同环境不确定性和风险信心水平下调整其风险意识行为的能力。所提出的方法通过现实世界的机器人实验进一步验证,一支无人机团队在考虑感知和/或通信危险区域的情况下成功跟踪了动态地面机器人。
https://arxiv.org/abs/2404.07880
The illegal disposal of trash is a major public health and environmental concern. Disposing of trash in unplanned places poses serious health and environmental risks. We should try to restrict public trash cans as much as possible. This research focuses on automating the penalization of litterbugs, addressing the persistent problem of littering in public places. Traditional approaches relying on manual intervention and witness reporting suffer from delays, inaccuracies, and anonymity issues. To overcome these challenges, this paper proposes a fully automated system that utilizes surveillance cameras and advanced computer vision algorithms for litter detection, object tracking, and face recognition. The system accurately identifies and tracks individuals engaged in littering activities, attaches their identities through face recognition, and enables efficient enforcement of anti-littering policies. By reducing reliance on manual intervention, minimizing human error, and providing prompt identification, the proposed system offers significant advantages in addressing littering incidents. The primary contribution of this research lies in the implementation of the proposed system, leveraging advanced technologies to enhance surveillance operations and automate the penalization of litterbugs.
垃圾随意丢弃是一个重大的公共卫生和环境问题。在未经规划的地方丢弃垃圾会带来严重的健康和环境风险。我们应该尽可能地限制公共场所的垃圾桶。这项研究专注于自动惩处乱扔垃圾者,解决公共场所乱扔垃圾的长期问题。传统方法依赖于人工干预和目击报告,存在延迟、不准确和匿名性问题。为了克服这些挑战,本文提出了一个完全自动化的系统,利用摄像头和先进的计算机视觉算法进行垃圾检测、物体追踪和面部识别。该系统准确识别和跟踪进行乱扔垃圾活动的人,通过面部识别附上他们的身份,并能够有效地执行禁止乱扔垃圾政策。通过减少对人工干预的依赖,降低人为错误,并提供及时的身份识别,所提出的系统在解决乱扔垃圾事件方面具有显著优势。本研究的最大贡献在于实施所提出的系统,利用先进技术增强监视操作并自动惩处乱扔垃圾者。
https://arxiv.org/abs/2404.07467
Forecasting the short-term spread of an ongoing disease outbreak is a formidable challenge due to the complexity of contributing factors, some of which can be characterized through interlinked, multi-modality variables such as epidemiological time series data, viral biology, population demographics, and the intersection of public policy and human behavior. Existing forecasting model frameworks struggle with the multifaceted nature of relevant data and robust results translation, which hinders their performances and the provision of actionable insights for public health decision-makers. Our work introduces PandemicLLM, a novel framework with multi-modal Large Language Models (LLMs) that reformulates real-time forecasting of disease spread as a text reasoning problem, with the ability to incorporate real-time, complex, non-numerical information that previously unattainable in traditional forecasting models. This approach, through a unique AI-human cooperative prompt design and time series representation learning, encodes multi-modal data for LLMs. The model is applied to the COVID-19 pandemic, and trained to utilize textual public health policies, genomic surveillance, spatial, and epidemiological time series data, and is subsequently tested across all 50 states of the U.S. Empirically, PandemicLLM is shown to be a high-performing pandemic forecasting framework that effectively captures the impact of emerging variants and can provide timely and accurate predictions. The proposed PandemicLLM opens avenues for incorporating various pandemic-related data in heterogeneous formats and exhibits performance benefits over existing models. This study illuminates the potential of adapting LLMs and representation learning to enhance pandemic forecasting, illustrating how AI innovations can strengthen pandemic responses and crisis management in the future.
预测正在进行的疾病爆发的短期浮动是一个具有挑战性的任务,因为相关因素的复杂性,一些因素可以通过相互关联、多模态变量(如流行病学时间序列数据、病毒生物学、人口学、公共卫生与人类行为之间的交汇)进行特征化。现有的预测模型框架在处理相关数据的复杂性以及可靠结果的转化方面存在困难,这阻碍了它们的表现和对公共卫生决策者的实用性洞察力。我们的工作引入了PandemicLLM,一种新颖的多模态大型语言模型(LLM)框架,将疾病传播的实时预测重新建模为文本推理问题,具有将实时、复杂、非数值信息纳入传统预测模型的能力。通过独特的AI-人类合作提示设计和时间序列表示学习,为LLM编码了多模态数据。该模型应用于COVID-19大流行,并训练利用文本公共卫生政策、基因组监测、空间和流行病学时间序列数据,随后在所有50个州进行了测试。实证研究证明,PandemicLLM是一个高效的 pandemic forecasting 框架,有效捕捉了新兴变种的传播影响,并提供及时、准确的预测。所提出的PandemicLLM为将 various pandemic-related data以异构格式纳入模型以及展示现有模型的性能优势提供了道路。本研究阐明了将LLM和表示学习适应以提高 pandemic forecasting的潜力,表明 AI 创新可以在未来加强 pandemic 应对和危机管理。
https://arxiv.org/abs/2404.06962
This article presents a deep reinforcement learning-based approach to tackle a persistent surveillance mission requiring a single unmanned aerial vehicle initially stationed at a depot with fuel or time-of-flight constraints to repeatedly visit a set of targets with equal priority. Owing to the vehicle's fuel or time-of-flight constraints, the vehicle must be regularly refueled, or its battery must be recharged at the depot. The objective of the problem is to determine an optimal sequence of visits to the targets that minimizes the maximum time elapsed between successive visits to any target while ensuring that the vehicle never runs out of fuel or charge. We present a deep reinforcement learning algorithm to solve this problem and present the results of numerical experiments that corroborate the effectiveness of this approach in comparison with common-sense greedy heuristics.
这篇文章介绍了一种基于深度强化学习的解决方案来解决需要一个最初在仓库内停留的无人直升机进行持久监视任务的问题,该任务要求无人机多次以平等优先级访问一组目标。由于无人机的燃料或飞行时间约束,车辆必须定期充电或放电。问题的目标是确定一个最优的访问目标序列,使得每次连续访问之间的时间间隔最小,同时确保车辆不会因为缺乏燃料或电量而停止。我们提出了一个基于深度强化学习的算法来解决这个问题,并通过数值实验验证了这种方法与常见策略梯度的效果。
https://arxiv.org/abs/2404.06423
Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations; hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VIT excels at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data. The official website of this work is available at: this https URL.
视频中的人类动作或活动识别是一个在计算机视觉领域的基本任务,应用于监视和监控、自动驾驶汽车、运动分析、人机交互等领域。传统的监督方法需要大量的标注数据进行训练,这些数据耗费时间和金钱。本文提出了一种新颖的方法,使用交叉架构伪标签与对比学习进行半监督动作识别。我们的框架利用已标注和未标注数据来稳健地学习视频中的动作表示,将伪标签与对比学习相结合,实现从两种类型的样本中有效地学习动作表示。我们引入了一种新颖的跨架构方法,其中3D卷积神经网络(3D CNN)和视频转换器(VIT)被用于捕捉不同动作表示的各个方面;因此我们称之为ActNetFormer。3D CNN在捕捉时间域中的空间特征和局部依赖方面表现出色,而VIT在捕捉帧之间的长距离依赖方面表现出色。通过将这两种互补架构集成到ActNetFormer框架中,我们的方法可以有效地捕捉动作的局部和全局上下文信息。这种全面的表示学习使模型能够在半监督动作识别任务中实现更好的性能,并利用每个架构的优势。在标准的动作识别数据集上进行的实验结果表明,我们的方法的表现优于现有方法,只需要很少的标注数据就能实现最先进的表现。本文的工作可以在其官方网站上查看:https:// this URL。
https://arxiv.org/abs/2404.06243
This chapter explores the role of patent protection in algorithmic surveillance and whether ordre public exceptions from patentability should apply to such patents, due to their potential to enable human rights violations. It concludes that in most cases, it is undesirable to exclude algorithmic surveillance patents from patentability, as the patent system is ill-equipped to evaluate the impacts of the exploitation of such technologies. Furthermore, the disclosure of such patents has positive externalities from the societal perspective by opening the black box of surveillance for public scrutiny.
本章探讨了专利保护在算法监视中的作用,以及是否应该对具有这种潜力的专利适用公共权利豁免。它得出的结论是,在大多数情况下,将算法监视专利排除在专利保护之外是不合适的,因为专利制度本身尚无法评估此类技术对人类权利的侵犯影响。此外,这些专利的披露从社会角度来看具有积极的外部性,因为它们为公众审查监视打开了一个透明盒子。
https://arxiv.org/abs/2404.05534
We introduce Dynamic Distinction Learning (DDL) for Video Anomaly Detection, a novel video anomaly detection methodology that combines pseudo-anomalies, dynamic anomaly weighting, and a distinction loss function to improve detection accuracy. By training on pseudo-anomalies, our approach adapts to the variability of normal and anomalous behaviors without fixed anomaly thresholds. Our model showcases superior performance on the Ped2, Avenue and ShanghaiTech datasets, where individual models are tailored for each scene. These achievements highlight DDL's effectiveness in advancing anomaly detection, offering a scalable and adaptable solution for video surveillance challenges.
我们提出了动态区分学习(DDL)用于视频异常检测,一种新颖的视频异常检测方法,将伪异常、动态异常权重和区分损失函数相结合,以提高检测准确性。通过在伪异常上训练,我们的方法适应了正常和异常行为的变异性,而没有固定的异常阈值。在 Ped2、Avenue 和 ShanghaiTech 数据集上,我们的模型在个人场景上展现了卓越的性能。这些成就突出了 DDL 在推动异常检测方面的有效性,为视频监控挑战提供了一个可扩展和适应的解决方案。
https://arxiv.org/abs/2404.04986
Flocking is a behavior where multiple agents in a system attempt to stay close to each other while avoiding collision and maintaining a desired formation. This is observed in the natural world and has applications in robotics, including natural disaster search and rescue, wild animal tracking, and perimeter surveillance and patrol. Recently, large language models (LLMs) have displayed an impressive ability to solve various collaboration tasks as individual decision-makers. Solving multi-agent flocking with LLMs would demonstrate their usefulness in situations requiring spatial and decentralized decision-making. Yet, when LLM-powered agents are tasked with implementing multi-agent flocking, they fall short of the desired behavior. After extensive testing, we find that agents with LLMs as individual decision-makers typically opt to converge on the average of their initial positions or diverge from each other. After breaking the problem down, we discover that LLMs cannot understand maintaining a shape or keeping a distance in a meaningful way. Solving multi-agent flocking with LLMs would enhance their ability to understand collaborative spatial reasoning and lay a foundation for addressing more complex multi-agent tasks. This paper discusses the challenges LLMs face in multi-agent flocking and suggests areas for future improvement and research.
集群是一种行为,其中系统中的多个代理试图保持彼此接近,同时避免碰撞并维持所需的队形。这种行为在自然环境中观察到,并且在机器人领域具有应用,包括自然灾害搜索和救援、野生动物跟踪和外围监控和巡逻。最近,大型语言模型(LLMs)表现出作为个体决策者在各种合作任务中令人印象深刻的能力。通过使用LLMs解决多代理器集群问题,将展示它们在需要空间和分散决策的情况下的实用性。然而,当LLM驱动的代理被要求实现多代理器集群时,他们往往无法达到期望的行为。经过广泛的测试,我们发现,拥有LLMs的代理通常会聚拢到它们初始位置的平均值,或者从彼此分离。经过分析,我们发现LLMs无法以有意义的方式理解保持形状或保持距离。通过解决多代理器集群问题使用LLMs将增强它们理解合作空间推理的能力,并为解决更复杂的多代理器任务奠定基础。本文讨论了LLMs在多代理器集群中面临的问题,并提出了未来改进和研究的一些方向。
https://arxiv.org/abs/2404.04752
Unmanned Aerial Vehicles (UAVs) are integral in various sectors like agriculture, surveillance, and logistics, driven by advancements in 5G. However, existing research lacks a comprehensive approach addressing both data freshness and security concerns. In this paper, we address the intricate challenges of data freshness, and security, especially in the context of eavesdropping and jamming in modern UAV networks. Our framework incorporates exponential AoI metrics and emphasizes secrecy rate to tackle eavesdropping and jamming threats. We introduce a transformer-enhanced Deep Reinforcement Learning (DRL) approach to optimize task offloading processes. Comparative analysis with existing algorithms showcases the superiority of our scheme, indicating its promising advancements in UAV network management.
无人机(UAVs)在农业、监视和物流等各个领域中至关重要,得益于5G技术的进步。然而,现有的研究缺乏一种全面的方法来解决数据新鲜度和安全问题。在本文中,我们解决了数据新鲜度和安全问题,尤其是在现代UAV网络中被窃听和干扰的背景下。我们的框架引入了指数化的自适应优化度指标,并着重于保密率来解决窃听和干扰威胁。我们引入了一种基于Transformer的深度强化学习(DRL)方法来优化任务卸载过程。与现有算法进行的比较分析显示,我们的方案具有优越性,表明其在UAV网络管理方面的潜在进展。
https://arxiv.org/abs/2404.04692
In the era of modern technology, object detection using the Gray Level Co-occurrence Matrix (GLCM) extraction method plays a crucial role in object recognition processes. It finds applications in real-time scenarios such as security surveillance and autonomous vehicle navigation, among others. Computational efficiency becomes a critical factor in achieving real-time object detection. Hence, there is a need for a detection model with low complexity and satisfactory accuracy. This research aims to enhance computational efficiency by selecting appropriate features within the GLCM framework. Two classification models, namely K-Nearest Neighbours (K-NN) and Support Vector Machine (SVM), were employed, with the results indicating that K-Nearest Neighbours (K-NN) outperforms SVM in terms of computational complexity. Specifically, K-NN, when utilizing a combination of Correlation, Energy, and Homogeneity features, achieves a 100% accuracy rate with low complexity. Moreover, when using a combination of Energy and Homogeneity features, K-NN attains an almost perfect accuracy level of 99.9889%, while maintaining low complexity. On the other hand, despite SVM achieving 100% accuracy in certain feature combinations, its high or very high complexity can pose challenges, particularly in real-time applications. Therefore, based on the trade-off between accuracy and complexity, the K-NN model with a combination of Correlation, Energy, and Homogeneity features emerges as a more suitable choice for real-time applications that demand high accuracy and low complexity. This research provides valuable insights for optimizing object detection in various applications requiring both high accuracy and rapid responsiveness.
在现代技术时代,利用灰度级共现矩阵(GLCM)提取方法进行物体检测在物体识别过程中起着关键作用。该方法在实时场景中的应用包括安全监控和自动驾驶等。计算效率成为实现实时物体检测的关键因素。因此,在GLCM框架内选择适当的特征是提高计算效率的必要条件。 本研究旨在通过选择适当的GLCM框架内的特征来提高计算效率。采用了两种分类模型,即K-近邻(K-NN)和支持向量机(SVM)。结果表明,K-NN在计算复杂性方面优于SVM。 具体来说,当K-NN结合了相关性、能量和同质性特征时,可以达到100%的准确率,同时具有较低的复杂性。此外,当K-NN结合了能量和同质性特征时,其准确率几乎可以达到99.9889%,而保持较低的复杂性。另一方面,尽管SVM在某些特征组合上可以达到100%的准确率,但它的复杂度高或非常高,因此在实时应用程序中可能会产生挑战,特别是在实时应用程序中。因此,基于准确性和复杂性之间的平衡,结合相关性、能量和同质性特征的K-NN模型在需要高准确性和低复杂性的实时应用程序中成为更合适的选择。 这项研究为优化各种需要高准确性和快速响应的应用程序中的物体检测提供了宝贵的洞见。
https://arxiv.org/abs/2404.04578
Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
基于视频的视觉关系检测任务,如视频场景图生成和视频场景关系检测,在精细视频理解中发挥着重要作用。然而,当前的视频视觉关系检测数据集存在两个主要限制,阻碍了该领域的研究进展。首先,它们没有在多人人际场景中探索复杂的人际互动。其次,现有数据集中的关系类型具有较低级的语义,并且通常可以通过外观或简单的先验信息来识别,而无需详细的空间时间上下文推理。然而,理解人类之间的高级互动对于理解复杂的人际视频(如体育和监视视频)至关重要。为了解决这个问题,我们提出了一个新的视频视觉关系检测任务:视频人际互动检测,并为此构建了一个名为SportsHHI的数据集。SportsHHI包含了篮球和排球运动中的34个高级互动类别。在11,398个关键帧上,有118,075个人体边界框和50,649个互动实例被注释。为了进行基准,我们提出了一个两阶段基线方法,并通过广泛的实验揭示了成功的人际交互检测的关键因素。我们希望SportsHHI能够刺激在视频中的人际互动理解的研究,并推动在视频视觉关系检测中发展空间时间上下文建模技术。
https://arxiv.org/abs/2404.04565
Video summarization is a crucial research area that aims to efficiently browse and retrieve relevant information from the vast amount of video content available today. With the exponential growth of multimedia data, the ability to extract meaningful representations from videos has become essential. Video summarization techniques automatically generate concise summaries by selecting keyframes, shots, or segments that capture the video's essence. This process improves the efficiency and accuracy of various applications, including video surveillance, education, entertainment, and social media. Despite the importance of video summarization, there is a lack of diverse and representative datasets, hindering comprehensive evaluation and benchmarking of algorithms. Existing evaluation metrics also fail to fully capture the complexities of video summarization, limiting accurate algorithm assessment and hindering the field's progress. To overcome data scarcity challenges and improve evaluation, we propose an unsupervised approach that leverages video data structure and information for generating informative summaries. By moving away from fixed annotations, our framework can produce representative summaries effectively. Moreover, we introduce an innovative evaluation pipeline tailored specifically for video summarization. Human participants are involved in the evaluation, comparing our generated summaries to ground truth summaries and assessing their informativeness. This human-centric approach provides valuable insights into the effectiveness of our proposed techniques. Experimental results demonstrate that our training-free framework outperforms existing unsupervised approaches and achieves competitive results compared to state-of-the-art supervised methods.
视频摘要是一个关键的研究领域,旨在有效地浏览和检索当今大量视频内容。随着多媒体数据的指数增长,从视频中提取有意义的表示已成为必不可少的。视频摘要技术通过选择关键帧、镜头或片段来捕捉视频的本质,自动生成简洁的摘要。这个过程提高了各种应用(包括视频监控、教育、娱乐和社交媒体)的效率和准确性。尽管视频摘要非常重要,但缺乏多样且具有代表性的数据集,限制了全面评估和基准测试算法的准确性。现有的评估指标也没有完全捕捉到视频摘要的复杂性,从而限制了对准确算法评估和领域进步的限制。为了克服数据稀缺的挑战,提高评估,我们提出了一个无监督的方法,利用视频数据结构和信息生成有信息的摘要。通过远离固定注释,我们的框架可以有效地生成代表性的摘要。此外,我们还引入了一个专门针对视频摘要的创新评估管道。人类参与者参与了评估,将生成的摘要与真实摘要进行比较,并评估其信息价值。这种以人为中心的方法提供了对我们所提技术的有效性的宝贵见解。实验结果表明,我们的无监督框架优于现有的无监督方法,并且在与最先进的监督方法相比之下取得了竞争力的结果。
https://arxiv.org/abs/2404.04564