Camouflaged Object Detection (COD), the task of identifying objects concealed within their environments, has seen rapid growth due to its wide range of practical applications. A key step toward developing trustworthy COD systems is the estimation and effective utilization of uncertainty. In this work, we propose a human-machine collaboration framework for classifying the presence of camouflaged objects, leveraging the complementary strengths of computer vision (CV) models and noninvasive brain-computer interfaces (BCIs). Our approach introduces a multiview backbone to estimate uncertainty in CV model predictions, utilizes this uncertainty during training to improve efficiency, and defers low-confidence cases to human evaluation via RSVP-based BCIs during testing for more reliable decision-making. We evaluated the framework in the CAMO dataset, achieving state-of-the-art results with an average improvement of 4.56\% in balanced accuracy (BA) and 3.66\% in the F1 score compared to existing methods. For the best-performing participants, the improvements reached 7.6\% in BA and 6.66\% in the F1 score. Analysis of the training process revealed a strong correlation between our confidence measures and precision, while an ablation study confirmed the effectiveness of the proposed training policy and the human-machine collaboration strategy. In general, this work reduces human cognitive load, improves system reliability, and provides a strong foundation for advancements in real-world COD applications and human-computer interaction. Our code and data are available at: this https URL.
伪装物体检测(COD)是指识别隐藏在环境中的物体的任务,由于其广泛的实际应用而迅速发展。开发可信的COD系统的一个关键步骤是估计和有效利用不确定性。在这项工作中,我们提出了一种结合计算机视觉(CV)模型与非侵入性脑机接口(BCI)优势的人机协作框架,用于识别伪装物体的存在情况。我们的方法引入了一个多视角骨干网络来估计CV模型预测中的不确定性,并在训练过程中使用这种不确定性以提高效率,在测试阶段通过基于RSVP的BCIs将低置信度案例转交给人工评估,从而做出更可靠的决策。 我们在CAMO数据集上对该框架进行了评估,取得了最先进的结果,平衡精度(BA)和F1分数分别平均提高了4.56%和3.66%,优于现有方法。对于表现最佳的参与者,改进程度达到了7.6%的BA和6.66%的F1分数。训练过程分析显示,我们的置信度测量与精确度之间存在强烈的相关性,而消融研究确认了所提出的训练策略和人机协作策略的有效性。 总的来说,这项工作减少了人类的认知负担,提高了系统的可靠性,并为现实世界中的COD应用以及人机交互的进步奠定了坚实的基础。我们的代码和数据可在以下网址获得:this https URL。
https://arxiv.org/abs/2502.08373
Automatic monitoring of tree plantations plays a crucial role in agriculture. Flawless monitoring of tree health helps farmers make informed decisions regarding their management by taking appropriate action. Use of drone images for automatic plantation monitoring can enhance the accuracy of the monitoring process, while still being affordable to small farmers in developing countries such as India. Small, low cost drones equipped with an RGB camera can capture high-resolution images of agricultural fields, allowing for detailed analysis of the well-being of the plantations. Existing methods of automated plantation monitoring are mostly based on satellite images, which are difficult to get for the farmers. We propose an automated system for plantation health monitoring using drone images, which are becoming easier to get for the farmers. We propose a dataset of images of trees with three categories: ``Good health", ``Stunted", and ``Dead". We annotate the dataset using CVAT annotation tool, for use in research purposes. We experiment with different well-known CNN models to observe their performance on the proposed dataset. The initial low accuracy levels show the complexity of the proposed dataset. Further, our study revealed that, depth-wise convolution operation embedded in a deep CNN model, can enhance the performance of the model on drone dataset. Further, we apply state-of-the-art object detection models to identify individual trees to better monitor them automatically.
自动监测树苗种植在农业中扮演着至关重要的角色。对树木健康的精确监控帮助农民根据实际情况采取适当措施,从而做出明智的管理决策。利用无人机图像进行自动化的农田监测可以提高监测过程的准确性,并且对于印度等发展中国家的小农户来说仍具有经济性。小型低成本无人机配备RGB摄像头能够捕捉到农业用地的高分辨率影像,使得树苗种植地的整体状况得以详细分析。 现有的自动化种植监控方法大多基于卫星图像,这些图像难以获取。我们提出了一种使用无人机图像进行自动化的植树健康监测系统,该系统对于农民来说越来越容易获得。为此,我们创建了一个包含三类标签(“良好”、“发育不良”和“死亡”)的树苗图像数据集,并利用CVAT注释工具对其进行标注,以供研究用途。 为了观察这些模型在所提数据集上的表现,我们在几个知名的CNN模型上进行了实验。初步低准确率显示了该数据集的复杂性。此外,我们的研究表明,在深度卷积神经网络模型中嵌入深度卷积操作可以提高模型在无人机图像数据集上的性能。最后,我们应用最新的目标检测模型来识别单个树木,从而更好地实现自动监控。 总的来说,通过利用无人机技术和先进的机器学习技术,我们可以为农业中的植树健康监测提供一种有效的解决方案,并且这些方案对小规模农民来说也是可负担得起的。
https://arxiv.org/abs/2502.08233
The growing demand for efficient semantic communication systems capable of managing diverse tasks and adapting to fluctuating channel conditions has driven the development of robust, resource-efficient frameworks. This article introduces a novel channel-adaptive and multi-task-aware semantic communication framework based on a masked auto-encoder architecture. Our framework optimizes the transmission of meaningful information by incorporating a multi-task-aware scoring mechanism that identifies and prioritizes semantically significant data across multiple concurrent tasks. A channel-aware extractor is employed to dynamically select relevant information in response to real-time channel conditions. By jointly optimizing semantic relevance and transmission efficiency, the framework ensures minimal performance degradation under resource constraints. Experimental results demonstrate the superior performance of our framework compared to conventional methods in tasks such as image reconstruction and object detection. These results underscore the framework's adaptability to heterogeneous channel environments and its scalability for multi-task applications, positioning it as a promising solution for next-generation semantic communication networks.
对高效语义通信系统的需求不断增长,这些系统能够处理多样化任务并适应变化的信道条件,从而推动了稳健且资源高效的框架的发展。本文介绍了一种基于掩码自编码器架构的新型信道自适应和多任务感知语义通信框架。我们的框架通过引入一个多任务感知评分机制来优化有意义信息的传输,该机制能够识别并在多个并发任务中优先处理语义重要数据。采用了一个信道感知提取器,以根据实时信道条件动态选择相关信息。通过同时优化语义相关性和传输效率,该框架确保在资源受限的情况下性能损失最小化。 实验结果表明,在图像重建和目标检测等任务中,我们的框架相比传统方法表现出优越的性能。这些结果强调了该框架适应异构信道环境的能力以及其多任务应用中的可扩展性,使其成为下一代语义通信网络的一个有前途的解决方案。
https://arxiv.org/abs/2502.08221
Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We evaluate our method on two popular large data sets and achieve significant performance over the state-of-the-art this http URL, our technique also provides memory and computational advantages over the competitive techniques.
深度伪造视频因其日益逼真的特性而引发了社区的广泛关注。因此,自动化检测伪造深度伪造视频的技术也吸引了研究人员的浓厚兴趣。目前用于检测伪造视频的方法主要依赖于全局帧特征,并且未能充分利用被操纵视频中的时空不一致性。此外,它们也无法关注到在空间和时间维度上特定操作所导致的细微且精确定位的模式变化。 为了填补这些空白,我们提出了一种基于神经网络的深度伪造探测器,它专注于分析伪造视频中每个帧级别及帧序列级别的局部操纵特征。该模型使用ResNet作为基础架构,并通过引入空间注意力机制加强了浅层帧级特征的学习能力。此外,模型的空间流还通过融合纹理增强后的浅层特征与深层特征来进一步提升其性能。 与此同时,我们的模型采用了一种距离注意力机制处理帧序列,并允许在更深层次中将时间注意力图与学习到的特征进行融合。整个模型被训练为分类器以检测伪造内容。我们在两个流行的大型数据集上评估了我们提出的方法,并取得了优于现有技术的重要表现提升。此外,相比于竞争性方法,我们的技术还提供了内存和计算上的优势。
https://arxiv.org/abs/2502.08216
In the field of synthetic aperture radar (SAR) remote sensing image interpretation, although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified, and the first multi-task dialogue benchmark in the SAR field has been successfully established. The project will be released at this https URL, aiming to promote the in-depth development and wide application of SAR visual language models.
在合成孔径雷达(SAR)遥感图像解释领域,尽管视觉语言模型(VLMs)在自然语言处理和图像理解方面取得了显著进展,但由于缺乏专业领域的知识经验,它们的应用仍受到限制。本文创新性地提出了首个大规模多模态对话数据集——SARChat-2M,该数据集包含约200万对高质量的图像文本配对,并涵盖了多种详细标注目标的情景。此数据集不仅支持包括视觉理解和物体检测在内的多项关键任务,还具有独特的创新点:本研究开发了一个针对SAR领域的视觉语言数据集和基准,这使得可以评估VLMs在解释SAR图像方面的能力,并为构建跨各种遥感垂直领域多模态数据集提供了一种典范框架。通过对16个主流VLM模型的实验验证了该数据集的有效性,并成功建立了首个SAR领域的多任务对话基准。该项目将在[此链接](https://this https URL)发布,旨在促进SAR视觉语言模型的深度发展和广泛应用。
https://arxiv.org/abs/2502.08168
We introduce \textbf{Knowledge Swapping}, a novel task designed to selectively regulate knowledge of a pretrained model by enabling the forgetting of user\-specified information, retaining essential knowledge, and acquiring new knowledge simultaneously. By delving into the analysis of knock-on feature hierarchy, we find that incremental learning typically progresses from low\-level representations to higher\-level semantics, whereas forgetting tends to occur in the opposite direction\-starting from high-level semantics and moving down to low-level features. Building upon this, we propose to benchmark the knowledge swapping task with the strategy of \textit{Learning Before Forgetting}. Comprehensive experiments on various tasks like image classification, object detection, and semantic segmentation validate the effectiveness of the proposed strategy. The source code is available at \href{this https URL}{this https URL}.
我们介绍了一种新的任务——**知识交换(Knowledge Swapping)**,该任务旨在通过允许遗忘用户指定的信息、保留核心知识以及同时获取新知识来选择性地调节预训练模型的知识。通过对级联特征层次的分析,我们发现增量学习通常从低层表示逐步进展到高层语义,而遗忘则往往相反,即从高层语义开始并向下延伸至低层特征。基于这一观察,我们提出了使用“先学后忘(Learning Before Forgetting)”策略来衡量知识交换任务的方法。在图像分类、目标检测和语义分割等多种任务上的全面实验验证了所提出策略的有效性。源代码可在[此处](this https URL)获取。
https://arxiv.org/abs/2502.08075
Detecting AI generated images is a challenging yet essential task. A primary difficulty arises from the detectors tendency to rely on spurious patterns, such as compression artifacts, which can influence its decisions. These issues often stem from specific patterns that the detector associates with the real data distribution, making it difficult to isolate the actual generative traces. We argue that an image should be classified as fake if and only if it contains artifacts introduced by the generative model. Based on this premise, we propose Stay Positive, an algorithm designed to constrain the detectors focus to generative artifacts while disregarding those associated with real data. Experimental results demonstrate that detectors trained with Stay Positive exhibit reduced susceptibility to spurious correlations, leading to improved generalization and robustness to post processing. Additionally, unlike detectors that associate artifacts with real images, those that focus purely on fake artifacts are better at detecting inpainted real images.
检测AI生成的图像是一项既具挑战性又十分重要的任务。主要困难在于,检测器往往依赖于一些虚假模式(如压缩伪影),这些模式会影响其判断结果。这些问题通常源于检测器将某些特定模式与真实数据分布关联起来,使得识别出真正的生成痕迹变得很困难。我们认为,一幅图像只有在包含由生成模型引入的特征时才应被归类为伪造品。基于这一前提,我们提出了Stay Positive算法,该算法旨在限制检测器仅关注生成模型产生的伪影,而忽略与真实数据相关的伪影。实验结果显示,使用Stay Positive训练的检测器具有减少对虚假相关性的敏感性,在处理后加工图像时表现出更好的泛化能力和鲁棒性。此外,与那些将特征关联于真图的检测器相比,专注于伪造品特征的检测器在识别修复后的真图方面表现更佳。
https://arxiv.org/abs/2502.07778
In multimedia applications such as films and video games, spatial audio techniques are widely employed to enhance user experiences by simulating 3D sound: transforming mono audio into binaural formats. However, this process is often complex and labor-intensive for sound designers, requiring precise synchronization of audio with the spatial positions of visual components. To address these challenges, we propose a visual-based spatial audio generation system - an automated system that integrates face detection YOLOv8 for object detection, monocular depth estimation, and spatial audio techniques. Notably, the system operates without requiring additional binaural dataset training. The proposed system is evaluated against existing Spatial Audio generation system using objective metrics. Experimental results demonstrate that our method significantly improves spatial consistency between audio and video, enhances speech quality, and performs robustly in multi-speaker scenarios. By streamlining the audio-visual alignment process, the proposed system enables sound engineers to achieve high-quality results efficiently, making it a valuable tool for professionals in multimedia production.
在多媒体应用(如电影和视频游戏)中,空间音频技术被广泛用于通过模拟三维声音来增强用户体验:将单声道音频转换为双耳格式。然而,这个过程对于音效设计师来说往往复杂且劳动密集型,需要精确地将音频与视觉组件的空间位置同步。为了应对这些挑战,我们提出了一种基于视觉的空间音频生成系统——一个集成了对象检测(使用YOLOv8进行面部检测)、单目深度估计和空间音频技术的自动化系统。值得注意的是,该系统在不需额外双耳数据集训练的情况下运行。 我们的提议系统通过客观指标与现有的空间音频生成系统进行了比较评估。实验结果表明,我们提出的方法显著提高了音频和视频之间的空间一致性,增强了语音质量,并且在多说话人场景中表现出良好的鲁棒性。通过简化音视同步的过程,所提出的系统使声音工程师能够高效地获得高质量的结果,成为多媒体制作专业人士的宝贵工具。
https://arxiv.org/abs/2502.07538
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be more sensitive to rephrasings indicating that both cases may overrely on fixed prompt patterns. In contrast, the Llama family and models with lower baseline accuracy show insignificant degradation, suggesting reduced dependency on superficial cues. Moreover, C-BOD's dataset- and model-agnostic design allows easy integration into training pipelines to promote more robust language understanding. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation.
大型语言模型(LLMs)通常在公共基准测试中表现出色,但这些高分可能掩盖了它们对特定数据集表面线索的过度依赖,而不是真正的语言理解。我们引入了一种元评估框架——Chameleon Benchmark Overfit Detector (C-BOD),通过参数化变换系统性地扭曲基准提示来检测LLMs的过拟合现象。通过在保持输入语义内容和标签的同时重新表述输入,C-BOD揭示了模型性能是否依赖于记忆化的模式。在MMLU基准测试中使用26种领先的LLMs进行评估后,我们的方法显示,在适度扰动下平均性能下降了2.15%,其中20个中的26个模型表现出统计上显著的差异。 值得注意的是,基线准确率较高的模型在受干扰时表现出了更大的性能差异,而较大的LLMs对重新表述更为敏感,这表明这两种情况都可能过度依赖固定的提示模式。相比之下,“Llama”系列模型和基线准确性较低的模型在扰动下表现出微不足道的退化,暗示它们较少依赖于表面线索。 此外,C-BOD的设计不依赖于特定的数据集或模型,允许轻松集成到训练管道中以促进更稳健的语言理解。我们的发现挑战了社区仅关注排行榜得分的做法,并倡导优先考虑LLM评估中的弹性和泛化能力。
https://arxiv.org/abs/2502.07445
The perception system is a a critical role of an autonomous driving system for ensuring safety. The driving scene perception system fundamentally represents an object detection task that requires achieving a balance between accuracy and processing speed. Many contemporary methods focus on improving detection accuracy but often overlook the importance of real-time detection capabilities when computational resources are limited. Thus, it is vital to investigate efficient object detection strategies for driving scenes. This paper introduces Fast-COS, a novel single-stage object detection framework crafted specifically for driving scene applications. The research initiates with an analysis of the backbone, considering both macro and micro architectural designs, yielding the Reparameterized Attention Vision Transformer (RAViT). RAViT utilizes Reparameterized Multi-Scale Depth-Wise Convolution (RepMSDW) and Reparameterized Self-Attention (RepSA) to enhance computational efficiency and feature extraction. In extensive tests across GPU, edge, and mobile platforms, RAViT achieves 81.4% Top-1 accuracy on the ImageNet-1K dataset, demonstrating significant throughput improvements over comparable backbone models such as ResNet, FastViT, RepViT, and EfficientFormer. Additionally, integrating RepMSDW into a feature pyramid network forms RepFPN, enabling fast and multi-scale feature fusion. Fast-COS enhances object detection in driving scenes, attaining an AP50 score of 57.2% on the BDD100K dataset and 80.0% on the TJU-DHD Traffic dataset. It surpasses leading models in efficiency, delivering up to 75.9% faster GPU inference and 1.38 higher throughput on edge devices compared to FCOS, YOLOF, and RetinaNet. These findings establish Fast-COS as a highly scalable and reliable solution suitable for real-time applications, especially in resource-limited environments like autonomous driving systems
感知系统是自动驾驶系统确保安全的关键组成部分。驾驶场景的感知任务本质上是一个对象检测问题,需要在准确性与处理速度之间取得平衡。许多当代方法侧重于提高检测准确率,却常常忽视了在计算资源有限的情况下实时检测能力的重要性。因此,研究高效的物体检测策略对于驾驶场景至关重要。本文介绍了Fast-COS,这是一种专为驾驶场景设计的新颖单阶段对象检测框架。 研究首先对基础架构进行了分析,从宏观和微观两个层面考虑,最终提出了再参数化注意力视觉变换器(RAViT)。RAViT利用了再参数化多尺度深度卷积(RepMSDW)与再参数化自注意力机制(RepSA),以此增强计算效率及特征提取能力。在GPU、边缘计算设备和移动平台上的广泛测试中,RAViT在ImageNet-1K数据集上达到了81.4%的Top-1准确率,并且相比ResNet、FastViT、RepViT以及EfficientFormer等同类基础模型,在吞吐量方面取得了显著进步。此外,将RepMSDW整合到特征金字塔网络中形成了再参数化特征金字塔(RepFPN),从而实现了快速且跨尺度的特征融合。 Fast-COS在对象检测方面大大提高了驾驶场景的表现,其在BDD100K数据集上的AP50得分为57.2%,而在TJU-DHD交通数据集上则达到了80%。相比FCOS、YOLOF以及RetinaNet等领先模型,Fast-COS在效率方面表现出色,在GPU推理中快至多75.9%,且边缘设备上的吞吐量高出了1.38倍。 这些发现确立了Fast-COS作为适用于实时应用场景的可扩展和可靠解决方案的地位,尤其适合资源受限环境下的自动驾驶系统。
https://arxiv.org/abs/2502.07417
Salient object detection (SOD) plays a critical role in vision-driven measurement systems (VMS), facilitating the detection and segmentation of key visual elements in an image. However, adverse imaging conditions such as haze during the day, low light, and haze at night severely degrade image quality, and complicating the SOD process. To address these challenges, we propose a multi-task-oriented nighttime haze imaging enhancer (MToIE), which integrates three tasks: daytime dehazing, low-light enhancement, and nighttime dehazing. The MToIE incorporates two key innovative components: First, the network employs a task-oriented node learning mechanism to handle three specific degradation types: day-time haze, low light, and night-time haze conditions, with an embedded self-attention module enhancing its performance in nighttime imaging. In addition, multi-receptive field enhancement module that efficiently extracts multi-scale features through three parallel depthwise separable convolution branches with different dilation rates, capturing comprehensive spatial information with minimal computational overhead. To ensure optimal image reconstruction quality and visual characteristics, we suggest a hybrid loss function. Extensive experiments on different types of weather/imaging conditions illustrate that MToIE surpasses existing methods, significantly enhancing the accuracy and reliability of vision systems across diverse imaging scenarios. The code is available at this https URL.
显著物体检测(SOD)在视觉驱动的测量系统(VMS)中扮演着关键角色,它能够帮助识别和分割图像中的重要视觉元素。然而,恶劣的成像条件如白天的雾、低光环境以及夜晚的雾会严重降低图像质量,并使SOD过程变得复杂化。为了应对这些挑战,我们提出了一种多任务导向夜间去雾增强器(MToIE),该系统集成了三项任务:白天去雾、低光增强和夜间去雾。MToIE包含两个关键创新组件: 首先,网络采用一种面向任务的节点学习机制来处理三种特定降级类型:白天雾霾、低光照条件以及夜间雾霾情况,并通过嵌入自注意力模块增强了其在夜间成像中的性能。 其次,多感受野增强模块可以通过三个并行深度可分离卷积分支(具有不同的膨胀率)有效地提取多层次特征,同时以最小的计算开销捕捉全面的空间信息。 为了确保最佳的图像重建质量和视觉特性,我们建议使用混合损失函数。多种天气/成像条件下的广泛实验表明,MToIE超越了现有方法,在各种成像场景下显著提高了视觉系统的准确性和可靠性。代码可在该网址获得:[请在此处提供实际链接](原文中提到应包含一个URL以供访问代码)。
https://arxiv.org/abs/2502.07351
In deepfake detection, it is essential to maintain high performance by adjusting the parameters of the detector as new deepfake methods emerge. In this paper, we propose a method to automatically and actively select the small amount of additional data required for the continuous training of deepfake detection models in situations where deepfake detection models are regularly updated. The proposed method automatically selects new training data from a \textit{redundant} pool set containing a large number of images generated by new deepfake methods and real images, using the confidence score of the deepfake detection model as a metric. Experimental results show that the deepfake detection model, continuously trained with a small amount of additional data automatically selected and added to the original training set, significantly and efficiently improved the detection performance, achieving an EER of 2.5% with only 15% of the amount of data in the pool set.
在深度伪造检测中,为了应对新出现的深度伪造方法,通过调整探测器参数来保持高性能至关重要。本文提出了一种方法,用于自动且主动地选择少量额外数据,这些数据对于定期更新的深度伪造检测模型进行持续训练是必要的。所提出的这种方法从一个包含大量由新的深度伪造技术生成的图像和真实图像的冗余池集中,根据深度伪造检测模型的置信度分数作为指标,自动挑选新的训练数据。 实验结果显示,在原有训练集的基础上,通过自动选择并添加少量额外的数据进行持续训练后,深度伪造检测模型在检测性能上显著且高效地得到了提升。仅使用了池集合中15%的数据量,就达到了2.5%的等错误率(EER)。
https://arxiv.org/abs/2502.07269
Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.
近年来,使用数十亿像素级别的图像和视频捕捉系统及高分辨率宽视野(HRW)拍摄的基准测试有所增加。然而,与MS COCO数据集中的特写镜头相比,更高的分辨率和更宽的视场带来了独特的挑战,例如极端稀疏性和巨大的尺度变化,这使得现有的特写检测器不准确且效率低下。在这篇论文中,我们提出了一种新颖的模型无关稀疏视觉变换器SparseFormer,以弥合特写与HRW镜头之间的目标检测差距。所提出的SparseFormer选择性地使用注意力令牌来仔细检查可能包含对象的稀疏分布窗口,在这种方式下,它可以同时探索全局和局部注意机制,通过融合粗粒度和细粒度特征来处理巨大的尺度变化。此外,SparseFormer还受益于一种新颖的跨切片非极大值抑制(C-NMS)算法,用于从嘈杂窗口中精确定位对象,并采用了一种简单而有效的多尺度策略以提高准确性。在两个HRW基准测试PANDA和DOTA-v1.0上的广泛实验表明,所提出的SparseFormer显著提高了检测精度(高达5.8%)和速度(快至3倍),优于现有最先进的方法。
https://arxiv.org/abs/2502.07216
Dense object detection is widely used in automatic driving, video surveillance, and other fields. This paper focuses on the challenging task of dense object detection. Currently, detection methods based on greedy algorithms, such as non-maximum suppression (NMS), often produce many repetitive predictions or missed detections in dense scenarios, which is a common problem faced by NMS-based algorithms. Through the end-to-end DETR (DEtection TRansformer), as a type of detector that can incorporate the post-processing de-duplication capability of NMS, etc., into the network, we found that homogeneous queries in the query-based detector lead to a reduction in the de-duplication capability of the network and the learning efficiency of the encoder, resulting in duplicate prediction and missed detection problems. To solve this problem, we propose learnable differentiated encoding to de-homogenize the queries, and at the same time, queries can communicate with each other via differentiated encoding information, replacing the previous self-attention among the queries. In addition, we used joint loss on the output of the encoder that considered both location and confidence prediction to give a higher-quality initialization for queries. Without cumbersome decoder stacking and guaranteeing accuracy, our proposed end-to-end detection framework was more concise and reduced the number of parameters by about 8% compared to deformable DETR. Our method achieved excellent results on the challenging CrowdHuman dataset with 93.6% average precision (AP), 39.2% MR-2, and 84.3% JI. The performance overperformed previous SOTA methods, such as Iter-E2EDet (Progressive End-to-End Object Detection) and MIP (One proposal, Multiple predictions). In addition, our method is more robust in various scenarios with different densities.
密集物体检测在自动驾驶、视频监控等领域中广泛应用。本文重点关注具有挑战性的密集物体检测任务。目前,基于贪婪算法(如非极大值抑制NMS)的检测方法,在稠密场景下常常会产生许多重复预测或漏检的问题,这是以NMS为基础的方法普遍面临的一个难题。通过端到端的DETR(DEtection TRansformer),这是一种能够将诸如NMS后处理去重能力等特性整合进网络中的探测器类型,我们发现基于查询的检测器中的同质性查询会导致网络的去重能力和编码器的学习效率下降,从而导致重复预测和漏检的问题。为解决这一问题,我们提出了可学习的差异化编码来异质化查询,并且通过差异化编码信息使得查询之间可以进行通信,替代了以前查询间的自注意力机制。此外,我们在编码器输出上采用了考虑位置和置信度预测的联合损失函数,以给查询提供更高质量的初始化。 在不增加复杂的解码堆叠并保证准确性的情况下,我们提出的端到端检测框架更加简洁,并且相比可变形DETR减少了大约8%的参数量。我们的方法在具有挑战性的CrowdHuman数据集上取得了93.6%平均精度(AP)、39.2% MR-2和84.3% JI的出色结果,超过了此前最先进的方法,如Iter-E2EDet(渐进式端到端物体检测)和MIP(一个提议,多个预测)。此外,我们的方法在不同密度的各种场景中表现出更高的鲁棒性。
https://arxiv.org/abs/2502.07194
The safe operation of high-voltage transmission lines ensures the power grid's security. Various foreign objects attached to the transmission lines, such as balloons, kites and nesting birds, can significantly affect the safe and stable operation of high-voltage transmission lines. With the advancement of computer vision technology, periodic automatic inspection of foreign objects is efficient and necessary. Existing detection methods have low accuracy because foreign objects at-tached to the transmission lines are complex, including occlusions, diverse object types, significant scale variations, and complex backgrounds. In response to the practical needs of the Yunnan Branch of China Southern Power Grid Co., Ltd., this paper proposes an improved YOLOv8m-based model for detecting foreign objects on transmission lines. Experiments are conducted on a dataset collected from Yunnan Power Grid. The proposed model enhances the original YOLOv8m by in-corporating a Global Attention Module (GAM) into the backbone to focus on occluded foreign objects, replacing the SPPF module with the SPPCSPC module to augment the model's multiscale feature extraction capability, and introducing the Focal-EIoU loss function to address the issue of high- and low-quality sample imbalances. These improvements accelerate model convergence and enhance detection accuracy. The experimental results demonstrate that our proposed model achieves a 2.7% increase in mAP_0.5, a 4% increase in mAP_0.5:0.95, and a 6% increase in recall.
高压输电线路的安全运行确保了电网的安全。输电线路上附着的各种外来物体,如气球、风筝和筑巢的鸟,会显著影响高压输电线路的安全稳定运行。随着计算机视觉技术的进步,定期自动检测这些外来物是高效且必要的。然而,现有的检测方法由于输电线上的外来物体复杂性(包括遮挡、多样化的物体类型、显著的尺度变化以及复杂的背景)而导致准确性较低。 为了应对中国南方电网有限公司云南分公司的实际需求,本文提出了一种改进版的YOLOv8m模型,用于检测输电线上附着的外来物。该实验基于从云南电力公司收集的数据集进行。我们通过在骨干网络中加入全局注意力模块(GAM)以聚焦于被遮挡的物体、用SPPCSPC模块替换原有的SPPF模块来增强多尺度特征提取能力,并引入了Focal-EIoU损失函数来解决高质量和低质量样本之间的不平衡问题,从而对原始YOLOv8m进行了改进。这些改进加速了模型的收敛速度并提高了检测精度。 实验结果表明,我们提出的模型在mAP_0.5上提升了2.7%,在mAP_0.5:0.95上提升了4%,并且召回率提高了6%。
https://arxiv.org/abs/2502.07175
Transformers have become foundational for visual tasks such as object detection, semantic segmentation, and video understanding, but their quadratic complexity in attention mechanisms presents scalability challenges. To address these limitations, the Mamba architecture utilizes state-space models (SSMs) for linear scalability, efficient processing, and improved contextual awareness. This paper investigates Mamba architecture for visual domain applications and its recent advancements, including Vision Mamba (ViM) and VideoMamba, which introduce bidirectional scanning, selective scanning mechanisms, and spatiotemporal processing to enhance image and video understanding. Architectural innovations like position embeddings, cross-scan modules, and hierarchical designs further optimize the Mamba framework for global and local feature extraction. These advancements position Mamba as a promising architecture in computer vision research and applications.
Transformer模型已成为物体检测、语义分割和视频理解等视觉任务的基础,但它们在注意力机制中的二次复杂性带来了可扩展性的挑战。为了应对这些限制,Mamba架构采用状态空间模型(SSMs)实现了线性可扩展性、高效处理以及上下文感知能力的提升。本文探讨了Mamba架构在视觉领域应用及其近期进展,包括Vision Mamba (ViM) 和VideoMamba,这两种方法引入双向扫描、选择性扫描机制和时空处理技术来增强图像和视频的理解能力。诸如位置嵌入、跨扫模块以及分层设计等架构创新进一步优化了Mamba框架,使其在全局与局部特征提取方面表现得更为出色。这些进步使得Mamba成为计算机视觉研究和应用领域中一个有前景的架构选择。
https://arxiv.org/abs/2502.07161
Autonomous drone navigation in dynamic environments remains a critical challenge, especially when dealing with unpredictable scenarios including fast-moving objects with rapidly changing goal positions. While traditional planners and classical optimisation methods have been extensively used to address this dynamic problem, they often face real-time, unpredictable changes that ultimately leads to sub-optimal performance in terms of adaptiveness and real-time decision making. In this work, we propose a novel motion planner, AgilePilot, based on Deep Reinforcement Learning (DRL) that is trained in dynamic conditions, coupled with real-time Computer Vision (CV) for object detections during flight. The training-to-deployment framework bridges the Sim2Real gap, leveraging sophisticated reward structures that promotes both safety and agility depending upon environment conditions. The system can rapidly adapt to changing environments, while achieving a maximum speed of 3.0 m/s in real-world scenarios. In comparison, our approach outperforms classical algorithms such as Artificial Potential Field (APF) based motion planner by 3 times, both in performance and tracking accuracy of dynamic targets by using velocity predictions while exhibiting 90% success rate in 75 conducted experiments. This work highlights the effectiveness of DRL in tackling real-time dynamic navigation challenges, offering intelligent safety and agility.
在动态环境中实现自主无人机导航依然是一个关键挑战,特别是在处理包括快速移动物体和目标位置迅速变化等不可预测场景时。尽管传统规划器和经典优化方法已被广泛用于解决这一动态问题,但它们往往难以应对实时的、不可预见的变化,这最终导致了适应性和实时决策方面的次优性能。在此研究中,我们提出了一种基于深度强化学习(DRL)的新颖运动规划器AgilePilot,并在动态条件下对其进行训练,同时结合飞行中的实时计算机视觉(CV)技术进行物体检测。该训练至部署框架通过利用复杂的奖励机制来弥合模拟与现实之间的差距,这些奖励机制根据环境条件促进安全性和灵活性。系统能够迅速适应不断变化的环境,在真实场景中实现最高3.0米/秒的速度。相比之下,我们的方法在性能和动态目标追踪精度方面超越了经典算法如人工势场(APF)运动规划器三倍,并且在75次实验中成功率为90%。本研究强调了DRL在解决实时动态导航挑战中的有效性,提供了智能的安全性和灵活性保障。
https://arxiv.org/abs/2502.06725
Visuomotor policies trained via imitation learning are capable of performing challenging manipulation tasks, but are often extremely brittle to lighting, visual distractors, and object locations. These vulnerabilities can depend unpredictably on the specifics of training, and are challenging to expose without time-consuming and expensive hardware evaluations. We propose the problem of predictive red teaming: discovering vulnerabilities of a policy with respect to environmental factors, and predicting the corresponding performance degradation without hardware evaluations in off-nominal scenarios. In order to achieve this, we develop RoboART: an automated red teaming (ART) pipeline that (1) modifies nominal observations using generative image editing to vary different environmental factors, and (2) predicts performance under each variation using a policy-specific anomaly detector executed on edited observations. Experiments across 500+ hardware trials in twelve off-nominal conditions for visuomotor diffusion policies demonstrate that RoboART predicts performance degradation with high accuracy (less than 0.19 average difference between predicted and real success rates). We also demonstrate how predictive red teaming enables targeted data collection: fine-tuning with data collected under conditions predicted to be adverse boosts baseline performance by 2-7x.
通过模仿学习训练的视动策略(visuomotor policies)能够执行具有挑战性的操作任务,但往往对光照、视觉干扰物和物体位置的变化极其脆弱。这些弱点可能取决于训练过程中的特定细节,并且在没有耗时且昂贵的硬件评估的情况下很难被揭示出来。我们提出了预测性红队(predictive red teaming)的问题:发现策略面对环境因素时的脆弱性,并且在非正常场景下,无需进行实际硬件评测就能预测相应的性能下降。 为了实现这一目标,我们开发了RoboART:一个自动化的红队工具(Automated Red Teaming, ART)流程。该流程包括两个步骤: 1. 使用生成式图像编辑技术修改名义上的观测数据,从而改变不同的环境因素。 2. 通过使用特定于策略的异常检测器执行对编辑后的观测数据的操作,预测在每种情况下下的性能表现。 实验结果表明,在十二个非正常条件中进行了超过500次硬件试验后,对于视动扩散政策(visuomotor diffusion policies),RoboART能够以高精度(预测成功率与实际成功率之间的平均差异小于0.19)来预测性能下降。此外,我们还展示了如何通过预测性红队进行有针对性的数据收集:在不利条件下采集数据并用于微调后,可以将基线性能提高2至7倍。
https://arxiv.org/abs/2502.06575
Watermarking plays a key role in the provenance and detection of AI-generated content. While existing methods prioritize robustness against real-world distortions (e.g., JPEG compression and noise addition), we reveal a fundamental tradeoff: such robust watermarks inherently improve the redundancy of detectable patterns encoded into images, creating exploitable information leakage. To leverage this, we propose an attack framework that extracts leakage of watermark patterns through multi-channel feature learning using a pre-trained vision model. Unlike prior works requiring massive data or detector access, our method achieves both forgery and detection evasion with a single watermarked image. Extensive experiments demonstrate that our method achieves a 60\% success rate gain in detection evasion and 51\% improvement in forgery accuracy compared to state-of-the-art methods while maintaining visual fidelity. Our work exposes the robustness-stealthiness paradox: current "robust" watermarks sacrifice security for distortion resistance, providing insights for future watermark design.
水印技术在AI生成内容的来源追溯和检测中扮演着关键角色。虽然现有的方法侧重于提高对现实世界失真的鲁棒性(例如JPEG压缩和噪声添加),我们揭示了一个基本的权衡:这样的鲁棒型水印会固有地增加嵌入图像中的可检测模式的冗余度,从而产生可以被利用的信息泄露。为此,我们提出了一种攻击框架,通过使用预训练的视觉模型进行多通道特征学习来提取这些泄漏出的水印模式。与之前需要大量数据或访问探测器的方法不同,我们的方法仅需一张带有水印的图像就能同时实现伪造和检测规避。 大量的实验表明,在保持视觉保真度的同时,相较于最先进的方法,我们的方式在检测规避成功率上提高了60%,在伪造准确率上提升了51%。我们的工作揭示了鲁棒性与隐蔽性的矛盾:当前所谓的“鲁棒”水印牺牲安全性以换取失真的抵抗能力,为未来水印设计提供了新的见解。
https://arxiv.org/abs/2502.06418
Current Point-based detectors can only learn from the provided points, with limited receptive fields and insufficient global learning capabilities for such targets. In this paper, we present a novel Point Dilation Mechanism for single-stage 3D detection (PDM-SSD) that takes advantage of these two representations. Specifically, we first use a PointNet-style 3D backbone for efficient feature encoding. Then, a neck with Point Dilation Mechanism (PDM) is used to expand the feature space, which involves two key steps: point dilation and feature filling. The former expands points to a certain size grid centered around the sampled points in Euclidean space. The latter fills the unoccupied grid with feature for backpropagation using spherical harmonic coefficients and Gaussian density function in terms of direction and scale. Next, we associate multiple dilation centers and fuse coefficients to obtain sparse grid features through height compression. Finally, we design a hybrid detection head for joint learning, where on one hand, the scene heatmap is predicted to complement the voting point set for improved detection accuracy, and on the other hand, the target probability of detected boxes are calibrated through feature fusion. On the challenging Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, PDM-SSD achieves state-of-the-art results for multi-class detection among single-modal methods with an inference speed of 68 frames. We also demonstrate the advantages of PDM-SSD in detecting sparse and incomplete objects through numerous object-level instances. Additionally, PDM can serve as an auxiliary network to establish a connection between sampling points and object centers, thereby improving the accuracy of the model without sacrificing inference speed. Our code will be available at this https URL.
当前基于点的检测器只能从提供的点中学习,其感受野有限且对目标的整体学习能力不足。本文提出了一种新颖的单阶段3D检测中的点膨胀机制(Point Dilation Mechanism for Single-Stage 3D Detection, PDM-SSD),该方法利用了两种表示形式的优势。具体来说,我们首先使用类似于PointNet的3D骨干网络进行高效的特征编码。然后采用具有点膨胀机制(PDM)的颈部结构来扩展特征空间,这包括两个关键步骤:点扩张和特征填充。前者将采样点在欧几里得空间中的周围区域扩展为一定大小的网格。后者使用球谐系数和高斯密度函数根据方向和尺度填充未占用的网格以供反向传播。接下来,我们将多个膨胀中心关联起来并融合系数,通过高度压缩获得稀疏的网格特征。最后,我们设计了一种混合检测头用于联合学习:一方面预测场景热图来补充投票点集以提高检测准确性;另一方面,通过对特征进行融合校准检测框的目标概率。在具有挑战性的卡尔斯鲁厄理工学院和丰田工业大学(KITTI)数据集中,PDM-SSD在单模态方法中实现了最先进的多类检测结果,并且推理速度达到68帧每秒。我们还通过许多对象级别的实例展示了PDM-SSD在检测稀疏和不完整物体方面的优势。此外,PDM可以作为辅助网络建立采样点与物体中心之间的联系,从而提高模型的准确性而不牺牲推理速度。我们的代码将在该链接提供。 (原文中的“this https URL”未具体给出实际URL地址,在实际应用时请查看作者提供的具体网址以获取代码资源)
https://arxiv.org/abs/2502.07822