While tactile sensing is widely accepted as an important and useful sensing modality, its use pales in comparison to other sensory modalities like vision and proprioception. AnySkin addresses the critical challenges that impede the use of tactile sensing -- versatility, replaceability, and data reusability. Building on the simplistic design of ReSkin, and decoupling the sensing electronics from the sensing interface, AnySkin simplifies integration making it as straightforward as putting on a phone case and connecting a charger. Furthermore, AnySkin is the first uncalibrated tactile-sensor with cross-instance generalizability of learned manipulation policies. To summarize, this work makes three key contributions: first, we introduce a streamlined fabrication process and a design tool for creating an adhesive-free, durable and easily replaceable magnetic tactile sensor; second, we characterize slip detection and policy learning with the AnySkin sensor; and third, we demonstrate zero-shot generalization of models trained on one instance of AnySkin to new instances, and compare it with popular existing tactile solutions like DIGIT and ReSkin.this https URL
虽然触觉感知被广泛认为是重要且有益的感知模式,但与其他感官模态(如视觉和本体感知)相比,其应用显得相形见绌。AnySkin 解决了阻碍触觉感知应用的关键挑战——多才多艺、可替换性和数据可重复使用性。在简化 ReSkin 的简单设计并解耦感测电路与感知界面之后,AnySkin 简化了集成,使其与给孩子带手机壳并连接充电器一样简单。此外,AnySkin 是第一个具有跨实例学习操作策略的未校准触觉传感器。总之,这项工作做出了以下三个关键贡献:首先,我们引入了简化制造过程和创建无粘性、耐用且易于更换的磁触觉传感器的工具设计;其次,我们用 AnySkin 传感器对滑动检测和策略学习进行了特征;最后,我们证明了在 AnySkin 上训练的模型对 new instances 的零样本泛化,并将其与受欢迎的现有触觉解决方案(如 DIGIT 和 ReSkin)进行了比较。这个链接
https://arxiv.org/abs/2409.08276
Alzheimer's Disease (AD) is a non-curable progressive neurodegenerative disorder that affects the human brain, leading to a decline in memory, cognitive abilities, and eventually, the ability to carry out daily tasks. Manual diagnosis of Alzheimer's disease from MRI images is fraught with less sensitivity and it is a very tedious process for neurologists. Therefore, there is a need for an automatic Computer Assisted Diagnosis (CAD) system, which can detect AD at early stages with higher accuracy. In this research, we have proposed a novel AD-Lite Net model (trained from scratch), that could alleviate the aforementioned problem. The novelties we bring here in this research are, (I) We have proposed a very lightweight CNN model by incorporating Depth Wise Separable Convolutional (DWSC) layers and Global Average Pooling (GAP) layers. (II) We have leveraged a ``parallel concatenation block'' (pcb), in the proposed AD-Lite Net model. This pcb consists of a Transformation layer (Tx-layer), followed by two convolutional layers, which are thereby concatenated with the original base model. This Tx-layer converts the features into very distinct kind of features, which are imperative for the Alzheimer's disease. As a consequence, the proposed AD-Lite Net model with ``parallel concatenation'' converges faster and automatically mitigates the class imbalance problem from the MRI datasets in a very generalized way. For the validity of our proposed model, we have implemented it on three different MRI datasets. Furthermore, we have combined the ADNI and AD datasets and subsequently performed a 10-fold cross-validation experiment to verify the model's generalization ability. Extensive experimental results showed that our proposed model has outperformed all the existing CNN models, and one recent trend Vision Transformer (ViT) model by a significant margin.
阿尔茨海默病(AD)是一种非治愈的进行性神经退行性疾病,会影响到人的大脑,导致记忆力和认知能力的下降,最终导致日常生活能力的丧失。从MRI图像中手动诊断阿尔茨海默病具有较低的敏感性,对神经科医生来说是一个非常费力和繁琐的过程。因此,有必要开发一种自动计算机辅助诊断(CAD)系统,该系统可以在早期阶段更准确地检测到AD。在这次研究中,我们提出了一个名为AD-Lite Net的新模型(从零开始训练),可以缓解上述问题。我们在这里提出的研究中的新奇之处是,(I)我们提出了一个非常轻量级的CNN模型,通过引入深度可分离卷积(DWSC)层和全局平均池化(GAP)层。 (II) 我们在提出的AD-Lite Net模型中利用了“并行连接块”(pcb)。这个pcb包括一个转换层(Tx-层),然后是两个卷积层,它们与原始基础模型相连接。Tx-层将特征转换为非常明显的特征,这对于阿尔茨海默病至关重要。因此,基于“并行连接”的AD-Lite Net模型在MRI数据集上收敛更快,并且自动减轻了类不平衡问题。为了验证我们提出的模型的有效性,我们在三个不同的MRI数据集上进行了实现。此外,我们将ADNI和AD数据集结合,然后对模型进行了10倍交叉验证实验,以验证模型的泛化能力。大量实验结果表明,与所有现有CNN模型相比,我们的提出的模型具有显著的领先优势,并且最近的一个趋势性视觉Transformer(ViT)模型在阿尔茨海默病方面的表现也非常突出。
https://arxiv.org/abs/2409.08170
In this paper, we present a novel approach to the development and deployment of an autonomous mosquito breeding place detector rover with the object and obstacle detection capabilities to control mosquitoes. Mosquito-borne diseases continue to pose significant health threats globally, with conventional control methods proving slow and inefficient. Amidst rising concerns over the rapid spread of these diseases, there is an urgent need for innovative and efficient strategies to manage mosquito populations and prevent disease transmission. To mitigate the limitations of manual labor and traditional methods, our rover employs autonomous control strategies. Leveraging our own custom dataset, the rover can autonomously navigate along a pre-defined path, identifying and mitigating potential breeding grounds with precision. It then proceeds to eliminate these breeding grounds by spraying a chemical agent, effectively eradicating mosquito habitats. Our project demonstrates the effectiveness that is absent in traditional ways of controlling and safeguarding public health. The code for this project is available on GitHub at - this https URL
在本文中,我们提出了一个新颖的方法来开发和部署自主蚊穴繁殖场检测机器人,以实现蚊虫检测和控制。蚊虫传播疾病对全球健康构成持续威胁,传统的控制方法证明缓慢且效率低下。随着对这两种疾病快速传播的担忧不断加剧,需要采取创新和有效的策略来管理蚊虫种群并防止疾病传播。为了减轻手动劳动和传统方法的局限性,我们的机器人采用了一种自主控制策略。利用我们自己的定制数据集,机器人可以精确地沿着预定义路径自主导航,通过喷洒化学剂识别和消除潜在的繁殖场所,有效根除了蚊虫栖息地。我们的项目展示了在传统控制方法中缺少的有效性和安全性。这个项目的代码可以在GitHub上找到,地址是- https://github.com/。
https://arxiv.org/abs/2409.08078
RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization (Depth SAO) as offset to represent real-world spatial relationships. Secondly, the similarity in the feature space of RGB-D is learned by Depth Linear Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi-scale features for meeting real-time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI (97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes (83.4% mIoU) datasets.
RGB-D已成为辅助驾驶中理解复杂场景的重要数据来源。然而,现有研究对深度图的固有空间特性关注不足。这个缺陷对注意表示造成了关注偏移问题导致的预测误差。为此,我们提出了一种新颖的可学习深度交互金字塔Transformer(DiPFormer),以探索深度的有效性。首先,我们引入了深度空间感知优化(Depth SAO)作为偏移,以表示真实世界空间关系。其次,通过深度线性交叉注意(Depth LCA),相似的特征空间被学习来阐明像素级别的空间差异。最后,我们使用了MLP解码器来有效地融合多尺度特征以满足实时需求。全面的实验证明,与传统的关注偏移解决方案相比,DiPFormer在道路检测(+7.5%)和语义分割(+4.9% / +1.5%)任务上取得了最先进的性能。DiPFormer在KITTI(97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360)和Cityscapes(83.4% mIoU)数据集上实现了最先进的性能。
https://arxiv.org/abs/2409.07995
We present Sparse R-CNN OBB, a novel framework for the detection of oriented objects in SAR images leveraging sparse learnable proposals. The Sparse R-CNN OBB has streamlined architecture and ease of training as it utilizes a sparse set of 300 proposals instead of training a proposals generator on hundreds of thousands of anchors. To the best of our knowledge, Sparse R-CNN OBB is the first to adopt the concept of sparse learnable proposals for the detection of oriented objects, as well as for the detection of ships in Synthetic Aperture Radar (SAR) images. The detection head of the baseline model, Sparse R-CNN, is re-designed to enable the model to capture object orientation. We also fine-tune the model on RSDD-SAR dataset and provide a performance comparison to state-of-the-art models. Experimental results shows that Sparse R-CNN OBB achieves outstanding performance, surpassing other models on both inshore and offshore scenarios. The code is available at: this http URL.
我们提出了Sparse R-CNN OBB,一种在SAR图像中检测定向物的新框架,利用稀疏学习提出的建议。Sparse R-CNN OBB通过简化架构和训练难度,利用了300个稀疏建议集,而不是在成千上万个锚点上训练提议生成器。据我们所知,Sparse R-CNN OBB是第一个采用稀疏学习提出建议的概念来检测定向物和船的SAR图像,以及检测SAR图像中的船。基线模型的检测头被重新设计,以使模型能够捕捉物体方向。我们还对RSDD-SAR数据集进行微调,并提供了与最先进的模型性能比较的结果。实验结果表明,Sparse R-CNN OBB在近海和远海场景中都取得了出色的性能,超过了其他模型。代码可在此处下载:http:// this URL。
https://arxiv.org/abs/2409.07973
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video. Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint. Then they adopt dense cross-modal attention to integrate multimodal information for DAVE. Thus these methods inevitably aggregate irrelevant noise and events, especially in complex and long videos, leading to imprecise detection. In this paper, we present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE. The core idea is to explore local temporal continuity nature of audio-visual events, which serves as informative yet free supervision signals to guide the filtering of irrelevant information and inspire the extraction of complementary multimodal information during both unimodal and cross-modal learning stages. i) Specifically, LOCO applies Locality-aware Correspondence Correction (LCC) to uni-modal features via leveraging cross-modal local-correlated properties without any extra annotations. This enforces uni-modal encoders to highlight similar semantics shared by audio and visual features. ii) To better aggregate such audio and visual features, we further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events by imposing local consistency within multimodal features in a data-driven manner. By incorporating LCC and CDP, LOCO provides solid performance gains and outperforms existing methods for DAVE. The source code will be released.
Dense-localization Audio-Visual Events (DAVE)旨在识别在未剪辑的视频中可以同时听到和看到的事件的时间边界和相应的类别。现有的方法通常将音频和视觉表示分别编码,而没有明确的跨模态对齐约束。然后他们采用密集跨模态关注来整合多模态信息。因此,这些方法不可避免地聚合无关噪声和事件,尤其是在复杂和长视频中,导致检测不准确。在本文中,我们提出了LOCO,一个基于局部注意的跨模态对应学习框架 for DAVE。核心思想是探索音频-视觉事件的局部时间连续性,这将作为有益的指导信号,在单模态和跨模态学习阶段引导滤波无关信息,并在两种学习阶段激发互补多模态信息的提取。 i) 特别地,LOCO 通过利用跨模态局部相关性性质,对单模态特征进行局部注意的修正。这强制单模态编码器强调相同语义特征的相似性。 ii) 为了更好地聚合这些音频和视觉特征,我们进一步定制了跨模态动态感知层 (CDP),通过在跨模态特征金字塔中根据数据驱动的方式对音频-视觉事件的局部时间模式进行理解。通过包括 LCC 和 CDP,LOCO 提供了对 DAVE 的高性能提升,并超越了现有方法。源代码将发布。
https://arxiv.org/abs/2409.07967
Online Grooming (OG) is a prevalent threat facing predominately children online, with groomers using deceptive methods to prey on the vulnerability of children on social media/messaging platforms. These attacks can have severe psychological and physical impacts, including a tendency towards revictimization. Current technical measures are inadequate, especially with the advent of end-to-end encryption which hampers message monitoring. Existing solutions focus on the signature analysis of child abuse media, which does not effectively address real-time OG detection. This paper proposes that OG attacks are complex, requiring the identification of specific communication patterns between adults and children. It introduces a novel approach leveraging advanced models such as BERT and RoBERTa for Message-Level Analysis and a Context Determination approach for classifying actor interactions, including the introduction of Actor Significance Thresholds and Message Significance Thresholds. The proposed method aims to enhance accuracy and robustness in detecting OG by considering the dynamic and multi-faceted nature of these attacks. Cross-dataset experiments evaluate the robustness and versatility of our approach. This paper's contributions include improved detection methodologies and the potential for application in various scenarios, addressing gaps in current literature and practices.
在线 grooming(OG)是主要面向儿童的在线威胁, groomers 使用欺骗性的方法在社交媒体/消息平台上利用儿童的脆弱性。这些攻击可能对儿童的心理和身体健康产生严重的影响,包括趋向于再次受害者的情况。目前的技術措施不足以應對这种攻擊,尤其是在出現了端到端加密後,這會干擾信息監視。現有的解決方案集中在兒童虐待媒體的 signature 分析,這沒有有效地解決實時 OG 檢測。本文提出,OG 攻擊是複雜的,需要確定成年人和兒童之間特定的信息模式。它介紹了一種利用 BERT 和 RoBERTa 等 advanced 模型進行消息級分析以及 Context Determination 方法對行為者互動進行分類的方法,包括引入 Actor 重要性閾值和消息重要性閾值。所提方法旨在通過考慮這些攻擊的動態和多樣性來提高 OG 的準確性和稳健性。跨數據庫實驗評估了我們方法的靈敏性和多樣性。本文的貢獻包括改善了 OG 檢測方法和應用於各種場景的潛力,解決了现有文獻和實踐中的缺口。
https://arxiv.org/abs/2409.07958
In the wake of a fabricated explosion image at the Pentagon, an ability to discern real images from fake counterparts has never been more critical. Our study introduces a novel multi-modal approach to detect AI-generated images amidst the proliferation of new generation methods such as Diffusion models. Our method, UGAD, encompasses three key detection steps: First, we transform the RGB images into YCbCr channels and apply an Integral Radial Operation to emphasize salient radial features. Secondly, the Spatial Fourier Extraction operation is used for a spatial shift, utilizing a pre-trained deep learning network for optimal feature extraction. Finally, the deep neural network classification stage processes the data through dense layers using softmax for classification. Our approach significantly enhances the accuracy of differentiating between real and AI-generated images, as evidenced by a 12.64% increase in accuracy and 28.43% increase in AUC compared to existing state-of-the-art methods.
在五角大楼遭受伪造爆炸图像的冲击之后,分辨真实图像和伪图像从未比现在更关键。我们的研究引入了一种新颖的多模态方法来检测在诸如扩散模型等新型方法盛行的环境中生成的人工图像。我们的方法 UGAD 涵盖了三个关键检测步骤:首先,我们将 RGB 图像转换为 YCbCr 通道并应用积分径操作来强调显著的径向特征。其次,使用预训练的深度学习网络进行空间平移,利用其提取最佳特征。最后,通过密集层使用 softmax 对数据进行处理进行分类。我们的方法显著增强了不同真实图像和伪图像之间的鉴别能力,据研究表明,与现有最先进的方法相比,准确率提高了 12.64%,AUC 提高了 28.43%。
https://arxiv.org/abs/2409.07913
Reducing false positives is essential for enhancing object detector performance, as reflected in the mean Average Precision (mAP) metric. Although object detectors have achieved notable improvements and high mAP scores on the COCO dataset, analysis reveals limited progress in addressing false positives caused by non-target visual clutter-background objects not included in the annotated categories. This issue is particularly critical in real-world applications, such as fire and smoke detection, where minimizing false alarms is crucial. In this study, we introduce COCO-FP, a new evaluation dataset derived from the ImageNet-1K dataset, designed to address this issue. By extending the original COCO validation dataset, COCO-FP specifically assesses object detectors' performance in mitigating background false positives. Our evaluation of both standard and advanced object detectors shows a significant number of false positives in both closed-set and open-set scenarios. For example, the AP50 metric for YOLOv9-E decreases from 72.8 to 65.7 when shifting from COCO to COCO-FP. The dataset is available at this https URL.
减少假阳性对于提高物体检测器的性能至关重要,这一点在平均精度(mAP)指标上得到了反映。尽管物体检测器在COCO数据集上已经取得了显著的改进和高mAP分数,但分析发现,由于未包括在注释类别的非目标视觉杂乱背景对象,虚假阳性问题并没有得到很好的解决。这个问题在现实应用中尤为重要,比如火灾和烟雾检测,因为 minimize false alarms 是至关重要的。在本研究中,我们引入了COCO-FP,一个从ImageNet-1K数据集中的新评估数据集,旨在解决这个问题。通过扩展原始的COCO验证数据集,COCO-FP特别评估了物体检测器在减轻背景虚假阳性方面的性能。我们对标准和高级物体检测器的评估结果表明,在关闭集和开集场景中都有大量的假阳性。例如,YOLOv9-E的AP50指标从COCO到COCO-FP减少了72.8到65.7。数据集可在https://这个链接处获得。
https://arxiv.org/abs/2409.07907
For traffic incident detection, the acquisition of data and labels is notably resource-intensive, rendering semi-supervised traffic incident detection both a formidable and consequential challenge. Thus, this paper focuses on traffic incident detection with a semi-supervised learning way. It proposes a semi-supervised learning model named FPMT within the framework of MixText. The data augmentation module introduces Generative Adversarial Networks to balance and expand the dataset. During the mix-up process in the hidden space, it employs a probabilistic pseudo-mixing mechanism to enhance regularization and elevate model precision. In terms of training strategy, it initiates with unsupervised training on all data, followed by supervised fine-tuning on a subset of labeled data, and ultimately completing the goal of semi-supervised training. Through empirical validation on four authentic datasets, our FPMT model exhibits outstanding performance across various metrics. Particularly noteworthy is its robust performance even in scenarios with low label rates.
对于交通事件检测,数据和标签的获取是相当资源密集的,这使得半监督交通事件检测成为一项具有挑战性和严重后果的艰巨任务。因此,本文重点关注使用半监督学习方法进行交通事件检测。它提出了一种名为FPMT的半监督学习模型,该模型基于MixText框架。数据增强模块引入了生成对抗网络来平衡和扩展数据集。在隐藏空间中的混合过程中,它采用了一种概率伪混合机制来增强正则化和提高模型精度。在训练策略方面,它从所有数据开始进行无监督训练,然后对一小部分带有标签的数据进行有监督微调,最后完成半监督训练目标。通过在四个真实数据集上的实证验证,我们的FPMT模型在各种指标上表现突出。尤其值得一提的是,即使在低标签率场景中,它也表现出极高的鲁棒性。
https://arxiv.org/abs/2409.07839
This study provides a comprehensive analysis of the YOLOv9 object detection model, focusing on its architectural innovations, training methodologies, and performance improvements over its predecessors. Key advancements, such as the Generalized Efficient Layer Aggregation Network GELAN and Programmable Gradient Information PGI, significantly enhance feature extraction and gradient flow, leading to improved accuracy and efficiency. By incorporating Depthwise Convolutions and the lightweight C3Ghost architecture, YOLOv9 reduces computational complexity while maintaining high precision. Benchmark tests on Microsoft COCO demonstrate its superior mean Average Precision mAP and faster inference times, outperforming YOLOv8 across multiple metrics. The model versatility is highlighted by its seamless deployment across various hardware platforms, from edge devices to high performance GPUs, with built in support for PyTorch and TensorRT integration. This paper provides the first in depth exploration of YOLOv9s internal features and their real world applicability, establishing it as a state of the art solution for real time object detection across industries, from IoT devices to large scale industrial applications.
这项研究对YOLOv9目标检测模型进行了全面分析,重点关注其架构创新、训练方法以及与其前任模型相比的性能改进。关键的进步,如通用的有效层聚合网络GELAN和可编程梯度信息PGI,显著增强了特征提取和梯度流,导致准确性和效率的提高。通过引入卷积操作和轻量级C3Ghost架构,YOLOv9在保持高精度的同时降低了计算复杂性。在微软COCO基准测试上,它显示了其在平均精度mAP和推理时间上的优越性,超过YOLOv8在这些指标上的表现。模型的多样性通过其在各种硬件平台上的无缝部署得到了突出,从边缘设备到高性能GPU,支持PyTorch和TensorRT集成。本文对YOLOv9的内部特征及其真实世界应用进行了深入探讨,将其确定为各个行业实时物体检测的最佳解决方案,从物联网设备到大型规模工业应用。
https://arxiv.org/abs/2409.07813
Wildlife monitoring via camera traps has become an essential tool in ecology, but the deployment of machine learning models for on-device animal classification faces significant challenges due to domain shifts and resource constraints. This paper introduces WildFit, a novel approach that reconciles the conflicting goals of achieving high domain generalization performance and ensuring efficient inference for camera trap applications. WildFit leverages continuous background-aware model fine-tuning to deploy ML models tailored to the current location and time window, allowing it to maintain robust classification accuracy in the new environment without requiring significant computational resources. This is achieved by background-aware data synthesis, which generates training images representing the new domain by blending background images with animal images from the source domain. We further enhance fine-tuning effectiveness through background drift detection and class distribution drift detection, which optimize the quality of synthesized data and improve generalization performance. Our extensive evaluation across multiple camera trap datasets demonstrates that WildFit achieves significant improvements in classification accuracy and computational efficiency compared to traditional approaches.
野生动物通过相机陷阱进行监测已成为生态学中不可或缺的工具,但机器学习模型在设备上进行动物分类面临显著的挑战,因为领域变化和资源限制。本文介绍了一种名为WildFit的新方法,它通过连续的背景感知模型微调来部署针对当前位置和时间窗口的定制化机器学习模型,使得在不需要大量计算资源的情况下,在新环境中保持稳健的分类准确性。WildFit通过背景感知数据合成来生成代表新领域的训练图像,通过将背景图像与源域中的动物图像混合来生成训练数据。我们通过背景漂移检测和类分布漂移检测来增强微调的有效性,从而优化合成数据的质量并提高泛化性能。我们在多个相机陷阱数据集上的广泛评估表明,与传统方法相比,WildFit在分类准确性和计算效率方面取得了显著的改进。
https://arxiv.org/abs/2409.07796
Effective human-robot collaboration (HRC) requires the robots to possess human-like intelligence. Inspired by the human's cognitive ability to selectively process and filter elements in complex environments, this paper introduces a novel concept and scene-understanding approach termed `relevance.' It identifies relevant components in a scene. To accurately and efficiently quantify relevance, we developed an event-based framework that selectively triggers relevance determination, along with a probabilistic methodology built on a structured scene representation. Simulation results demonstrate that the relevance framework and methodology accurately predict the relevance of a general HRC setup, achieving a precision of 0.99 and a recall of 0.94. Relevance can be broadly applied to several areas in HRC to improve task planning time by 79.56% compared with pure planning for a cereal task, reduce perception latency by up to 26.53% for an object detector, improve HRC safety by up to 13.50% and reduce the number of inquiries for HRC by 75.36%. A real-world demonstration showcases the relevance framework's ability to intelligently assist humans in everyday tasks.
有效的机器人-人类协作(HRC)需要机器人具有人类类似的智能。受到人类在复杂环境中选择性地处理和过滤元素的能力的启发,本文提出了一种新颖的概念和场景理解方法,称为“相关性”。它识别出场景中的相关组件。为了准确和高效地量化相关性,我们开发了一个基于事件的框架,该框架选择性地触发相关性确定,并基于结构化的场景表示建立了一个概率方法。仿真结果表明,相关性和方法准确预测了通用HRC设置的相关性,达到0.99的精度和0.94的召回率。相关性可以广泛应用于HRC的多个领域,以通过纯计划任务计划时间减少79.56%,通过物体检测器的感知延迟减少26.53%,通过提高HRC安全性减少13.50%,并通过减少HRC查询减少75.36%。一个真实世界的演示展示了相关性框架在协助人类完成日常任务方面的智能作用。
https://arxiv.org/abs/2409.07753
Image operation chain detection techniques have gained increasing attention recently in the field of multimedia forensics. However, existing detection methods suffer from the generalization problem. Moreover, the channel correlation of color images that provides additional forensic evidence is often ignored. To solve these issues, in this article, we propose a novel two-stream multi-channels fusion networks for color image operation chain detection in which the spatial artifact stream and the noise residual stream are explored in a complementary manner. Specifically, we first propose a novel deep residual architecture without pooling in the spatial artifact stream for learning the global features representation of multi-channel correlation. Then, a set of filters is designed to aggregate the correlation information of multi-channels while capturing the low-level features in the noise residual stream. Subsequently, the high-level features are extracted by the deep residual model. Finally, features from the two streams are fed into a fusion module, to effectively learn richer discriminative representations of the operation chain. Extensive experiments show that the proposed method achieves state-of-the-art generalization ability while maintaining robustness to JPEG compression. The source code used in these experiments will be released at this https URL.
近年来,在多媒体 forensics 领域,图像操作链检测技术受到了越来越多的关注。然而,现有的检测方法存在泛化问题。此外,提供额外法线证据的彩色图像的通道相关性通常被忽视。为解决这些问题,本文提出了一种新颖的彩色图像操作链检测网络,其中在空间伪影流和噪声残骸流中以互补方式探索空间伪影流和噪声残骸流的通道相关性。具体来说,我们首先提出了一种没有池化的空间伪影流中的新颖深残差架构,以学习多通道相关性。然后,设计了一系列滤波器,以聚合多通道的关联信息,同时捕获噪声残骸流中的低级特征。接下来,通过深残差模型提取高级特征。最后,将两个流中的特征输入到融合模块中,以有效学习操作链的更丰富差异表示。在大量实验中,与最先进的基于 JPEG 压缩的模型相比,所提出的方法具有卓越的泛化能力,同时保持对 JPEG 压缩的鲁棒性。这些实验中所使用的源代码将在此链接中发布。
https://arxiv.org/abs/2409.07701
Autonomous robots use simultaneous localization and mapping (SLAM) for efficient and safe navigation in various environments. LiDAR sensors are integral in these systems for object identification and localization. However, LiDAR systems though effective in detecting solid objects (e.g., trash bin, bottle, etc.), encounter limitations in identifying semitransparent or non-tangible objects (e.g., fire, smoke, steam, etc.) due to poor reflecting characteristics. Additionally, LiDAR also fails to detect features such as navigation signs and often struggles to detect certain hazardous materials that lack a distinct surface for effective laser reflection. In this paper, we propose a highly accurate stereo-vision approach to complement LiDAR in autonomous robots. The system employs advanced stereo vision-based object detection to detect both tangible and non-tangible objects and then uses simple machine learning to precisely estimate the depth and size of the object. The depth and size information is then integrated into the SLAM process to enhance the robot's navigation capabilities in complex environments. Our evaluation, conducted on an autonomous robot equipped with LiDAR and stereo-vision systems demonstrates high accuracy in the estimation of an object's depth and size. A video illustration of the proposed scheme is available at: \url{this https URL}.
自动驾驶机器人使用同时定位与映射(SLAM)方法在各种环境中实现高效和安全导航。激光雷达传感器在这些系统中被认为是物体识别和定位的关键。然而,尽管激光雷达系统在检测固体物体(例如垃圾箱、瓶子等)方面非常有效,但由于其反射特性的差劲,其对于半透明或无法触及的物体的识别能力有限。此外,激光雷达系统也无法检测出诸如导航标志这样的特征,并且经常在检测某些危险材料时遇到困难,因为这些材料没有明显的表面,无法进行有效的激光反射。在本文中,我们提出了一个在自动驾驶机器人中补充激光雷达的自适应立体视觉方法。该系统采用先进的立体视觉为基础进行物体检测,以检测物体并准确估计其深度和尺寸。然后,通过简单的机器学习,将深度和尺寸信息融入SLAM过程,以增强机器人复杂环境中的导航能力。我们对配备激光雷达和立体视觉系统的自动驾驶机器人进行的评估表明,物体深度和尺寸估计具有很高的准确性。该方案的视频示例可以在:\url{这个链接}找到。
https://arxiv.org/abs/2409.07623
The dissemination of Large Language Models (LLMs), trained at scale, and endowed with powerful text-generating abilities has vastly increased the threats posed by generative AI technologies by reducing the cost of producing harmful, toxic, faked or forged content. In response, various proposals have been made to automatically discriminate artificially generated from human-written texts, typically framing the problem as a classification problem. Most approaches evaluate an input document by a well-chosen detector LLM, assuming that low-perplexity scores reliably signal machine-made content. As using one single detector can induce brittleness of performance, we instead consider several and derive a new, theoretically grounded approach to combine their respective strengths. Our experiments, using a variety of generator LLMs, suggest that our method effectively increases the robustness of detection.
大规模语言模型的传播以及其具有强大的文本生成能力,极大地增加了生成人工智能技术可能带来的威胁,通过降低制作有害、有毒、伪造或篡改内容的经济成本。为了应对这种情况,各种建议都被提出了,通常将问题封装为分类问题。大多数方法通过选择一个良好的检测器LLM来评估输入文档,假设低置信度分数可以可靠地表示机器生成的内容。然而,使用单一检测器可能会导致性能的脆弱性,因此我们考虑了多个,并得出了一种基于理论的新方法,结合它们各自的优点。我们使用多种生成器LLM的实验结果表明,我们的方法有效地提高了检测的鲁棒性。
https://arxiv.org/abs/2409.07615
Abnormal behavior detection, action recognition, fight and violence detection in videos is an area that has attracted a lot of interest in recent years. In this work, we propose an architecture that combines a Bidirectional Gated Recurrent Unit (BiGRU) and a 2D Convolutional Neural Network (CNN) to detect violence in video sequences. A CNN is used to extract spatial characteristics from each frame, while the BiGRU extracts temporal and local motion characteristics using CNN extracted features from multiple frames. The proposed end-to-end deep learning network is tested in three public datasets with varying scene complexities. The proposed network achieves accuracies up to 98%. The obtained results are promising and show the performance of the proposed end-to-end approach.
异常行为检测、动作识别、战斗和暴力检测是近年来引起广泛关注的一个领域。在这项工作中,我们提出了一个结合双向门控循环单元(BiGRU)和2D卷积神经网络(CNN)来检测视频序列中的暴力的架构。CNN用于从每个帧中提取空间特征,而BiGRU通过从多个帧中提取CNN提取的特征来提取时间和局部运动特征。所提出的端到端深度学习网络在三个公开数据集上的测试结果达到98%的准确率。得到的结果表明,所提出的端到端方法的性能是有前途的,证明了所提出的端到端方法的潜力。
https://arxiv.org/abs/2409.07588
Violence and abnormal behavior detection research have known an increase of interest in recent years, due mainly to a rise in crimes in large cities worldwide. In this work, we propose a deep learning architecture for violence detection which combines both recurrent neural networks (RNNs) and 2-dimensional convolutional neural networks (2D CNN). In addition to video frames, we use optical flow computed using the captured sequences. CNN extracts spatial characteristics in each frame, while RNN extracts temporal characteristics. The use of optical flow allows to encode the movements in the scenes. The proposed approaches reach the same level as the state-of-the-art techniques and sometime surpass them. It was validated on 3 databases achieving good results.
近年来,由于全球大城市犯罪率的上升,暴力与异常行为检测研究受到了越来越多的关注。在这项工作中,我们提出了一个结合循环神经网络(RNN)和2维卷积神经网络(2D CNN)的深度学习架构来进行暴力检测。除了视频帧,我们还使用捕获序列计算光学流。CNN提取每个帧的空间特征,而RNN提取时间特征。利用光学流编码场景中的运动。所提出的方法达到了与最先进技术的水平相同,有时甚至超越它们。它在一共3个数据库上进行了验证,取得了良好的结果。
https://arxiv.org/abs/2409.07581
Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection. However, they require considerable computational resources due to the quadratic size of the attention weights. In this work, we propose to cluster the transformer input on the basis of its entropy. The reason for this is that the self-information of each pixel (whose sum is the entropy), is likely to be similar among pixels corresponding to the same objects. Clustering reduces the size of data given as input to the transformer and therefore reduces training time and GPU memory usage, while at the same time preserves meaningful information to be passed through the remaining parts of the network. The proposed process is organized in a module called ENACT, that can be plugged-in any transformer architecture that consists of a multi-head self-attention computation in its encoder. We ran extensive experiments using the COCO object detection dataset, and three detection transformers. The obtained results demonstrate that in all tested cases, there is consistent reduction in the required computational resources, while the precision of the detection task is only slightly reduced. The code of the ENACT module will become available at this https URL
Transformer 在基于视觉的对象检测问题上的精度表现具有竞争力。然而,由于注意力权重的大小是二次的,它们需要相当数量的计算资源。在这项工作中,我们提出了一种基于熵对 Transformer 输入进行聚类的方案。原因在于每个像素(其和等于熵)的自信息很可能在对应于同一对象的像素之间是相似的。聚类减少了输入数据的大小,从而降低了训练时间和 GPU 内存使用率,同时保留了网络中剩余部分需要传递的有用信息。所提出的过程组织成一个名为 ENACT 的模块,可以插入任何由多头自注意力计算组成的 Transformer 架构中。我们在 COCO 物体检测数据集和三个检测 Transformer 上进行了广泛的实验,得到的结果表明,在所有测试情况下,所需计算资源的有规律的减少,而检测任务的准确度只有略微降低。ENACT 模块的代码将在这个链接 https://url 上公布。
https://arxiv.org/abs/2409.07541
This paper proposes a novel Perturb-ability Score (PS) that can be used to identify Network Intrusion Detection Systems (NIDS) features that can be easily manipulated by attackers in the problem-space. We demonstrate that using PS to select only non-perturb-able features for ML-based NIDS maintains detection performance while enhancing robustness against adversarial attacks.
本文提出了一种名为Perturb-ability Score (PS)的新方法,可用于识别问题空间中攻击者可以轻松操纵的网络入侵检测系统(NIDS)特征。我们证明了,使用PS仅选择非可操纵性特征来支持基于机器学习的NIDS,可以在保持检测性能的同时提高对抗性攻击的鲁棒性。
https://arxiv.org/abs/2409.07448