This paper presents RADAR-Robust Adversarial Detection via Adversarial Retraining-an approach designed to enhance the robustness of adversarial detectors against adaptive attacks, while maintaining classifier performance. An adaptive attack is one where the attacker is aware of the defenses and adapts their strategy accordingly. Our proposed method leverages adversarial training to reinforce the ability to detect attacks, without compromising clean accuracy. During the training phase, we integrate into the dataset adversarial examples, which were optimized to fool both the classifier and the adversarial detector, enabling the adversarial detector to learn and adapt to potential attack scenarios. Experimental evaluations on the CIFAR-10 and SVHN datasets demonstrate that our proposed algorithm significantly improves a detector's ability to accurately identify adaptive adversarial attacks -- without sacrificing clean accuracy.
本文介绍了一种名为“通过对抗性重训练增强雷达稳健攻击检测”的方法,旨在提高对抗性检测器对自适应攻击的鲁棒性,同时保持分类器性能。自适应攻击是指攻击者知道防御措施并相应地调整其策略的攻击。我们所提出的方法利用对抗性训练来增强检测攻击的能力,而不会牺牲准确性。在训练阶段,我们将对抗性样本集成到数据集中,这些样本经过优化以欺骗分类器和对抗性检测器,使对抗性检测器能够学习和适应潜在攻击场景。在CIFAR-10和SVHN数据集上的实验评估证明,与传统的检测方法相比,我们所提出的算法显著提高了检测器准确识别自适应攻击的能力——而不会牺牲准确性。
https://arxiv.org/abs/2404.12120
Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose $\textbf{Zip}$ which $\textbf{Z}$ips up CL$\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at this https URL
基础模型,在大量数据上预训练,已经在各种下游任务中展示了出色的零样本能力。然而,在目标检测和实例分割这两个对大量人类标注依赖的基本计算机视觉任务中,基础模型如SAM和DINO很难实现令人满意的成绩。在这项研究中,我们揭示了对象边界就在那里,即这些基础模型无法区分单个对象的边界。对于第一个,我们观察到CLIP,它从未访问过任何实例级别的标注,在其特定中间层的聚类结果中可以提供高度有益的实例级别边界先验。接着,我们提出了Zip,它将CLIP和SAM在一种新颖的分类先于发现的数据管道中结合,实现无标注、复杂场景 capable的开放词汇目标检测和实例分割。我们的Zip显著提高了SAM在COCO数据集上的掩码AP,并建立了各种设置中的最先进性能,包括无需训练、自训练和标签效率微调。此外,无标注的Zip甚至与使用基本注释的最佳性能对象检测器相当。代码发布在https://这个URL上。
https://arxiv.org/abs/2404.11957
The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to an action recognition model for extracting video features and learning the object relations for action recognition. However, since the action prior is unknown in the object detection stage, important objects could be easily overlooked, leading to inferior action recognition performance. In this paper, we propose an end-to-end object-centric action recognition framework that simultaneously performs Detection And Interaction Reasoning in one stage. Particularly, after extracting video features with a base network, we create three modules for concurrent object detection and interaction reasoning. First, a Patch-based Object Decoder generates proposals from video patch tokens. Then, an Interactive Object Refining and Aggregation identifies important objects for action recognition, adjusts proposal scores based on position and appearance, and aggregates object-level info into a global video representation. Lastly, an Object Relation Modeling module encodes object relations. These three modules together with the video feature extractor can be trained jointly in an end-to-end fashion, thus avoiding the heavy reliance on an off-the-shelf object detector, and reducing the multi-stage training burden. We conduct experiments on two datasets, Something-Else and Ikea-Assembly, to evaluate the performance of our proposed approach on conventional, compositional, and few-shot action recognition tasks. Through in-depth experimental analysis, we show the crucial role of interactive objects in learning for action recognition, and we can outperform state-of-the-art methods on both datasets.
人与物体之间的互动对于识别物体中心行动非常重要。现有的方法通常采用两阶段流程,首先使用预训练的检测器检测物体建议,然后将它们输入到动作识别模型中,以提取视频特征并学习动作识别中的物体关系。然而,在物体检测阶段,动作先验未知,重要物体可能很容易被忽视,导致动作识别性能下降。在本文中,我们提出了一种端到端的物体中心动作识别框架,在同一阶段同时执行检测和交互推理。特别地,在提取视频特征的基础上,我们创建了三个并发物体检测和交互推理模块。首先,基于补丁的对象编码器生成视频补丁标记的提议。然后,一个交互式物体精炼和聚合模块确定动作识别中的重要物体,根据位置和外观调整提议得分,并将物体级信息汇总到全局视频表示中。最后,一个物体关系建模模块编码物体关系。这三个模块与视频特征提取器可以以协同训练的方式进行训练,从而避免对预定义的物体检测器的过度依赖,并减少多阶段训练负担。我们对两个数据集 Something-Else 和 Ikea-Assembly 进行了实验,以评估所提出方法在传统、组合和少样本动作识别任务上的性能。通过深入的实验分析,我们证明了交互式物体在动作识别中的关键作用,并且在两个数据集上都能够超越最先进的 methods。
https://arxiv.org/abs/2404.11903
Autonomous driving requires an accurate representation of the environment. A strategy toward high accuracy is to fuse data from several sensors. Learned Bird's-Eye View (BEV) encoders can achieve this by mapping data from individual sensors into one joint latent space. For cost-efficient camera-only systems, this provides an effective mechanism to fuse data from multiple cameras with different views. Accuracy can further be improved by aggregating sensor information over time. This is especially important in monocular camera systems to account for the lack of explicit depth and velocity measurements. Thereby, the effectiveness of developed BEV encoders crucially depends on the operators used to aggregate temporal information and on the used latent representation spaces. We analyze BEV encoders proposed in the literature and compare their effectiveness, quantifying the effects of aggregation operators and latent representations. While most existing approaches aggregate temporal information either in image or in BEV latent space, our analyses and performance comparisons suggest that these latent representations exhibit complementary strengths. Therefore, we develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces. We consider subsequent image frames as stereo through time and leverage methods from optical flow estimation for temporal stereo encoding. Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation. The ablation uncovers a strong synergy of joint temporal aggregation in the image and BEV latent space. These results indicate the overall effectiveness of our approach and make a strong case for aggregating temporal information in both image and BEV latent spaces.
自动驾驶需要准确地描述环境。实现高准确度的策略是将来自多个传感器的数据进行融合。通过将来自单个传感器的数据映射到联合latent空间,学习到的Bird's-Eye View (BEV)编码器可以实现这一目标。对于成本效益高的摄像头仅系统,这提供了一种将来自不同视角的数据进行融合的有效机制。通过在一段时间内聚合传感器信息,可以进一步提高准确性。这对于单目相机系统尤为重要,因为它们缺乏明确的深度和速度测量。因此,开发出的BEV编码器的有效性取决于用于聚合时间信息的操作员和使用的潜在表示空间。我们分析了许多文献中提出的BEV编码器,并比较了它们的有效性,并量化了聚合操作符和潜在表示空间的影响。虽然大多数现有方法在图像或BEV潜在空间中聚合时间信息,但我们的分析和性能比较结果表明,这些潜在表示空间表现出互补的优势。因此,我们开发了一个新颖的时间BEV编码器,TempBEV,它整合了来自两个潜在空间的时间聚合信息。我们将接下来的图像帧视为立体通过时间,并利用光学流估计的方法进行时间立体编码。在NuScenes数据集上的实证评估表明,TempBEV在3D物体检测和BEV分割方面的性能显著优于基线。消融揭示了图像和BEV潜在空间中关节时间聚合的强烈协同作用。这些结果表明,我们的方法的整体有效性,以及将时间信息在图像和BEV潜在空间中进行聚合的必要性。
https://arxiv.org/abs/2404.11803
LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world, the exact conditions of deployment and access to samples representative of the test dataset may be unavailable while training. We argue that the more realistic and challenging formulation is to require robustness in performance to unseen target domains. We propose to address this problem in a two-pronged manner. First, we leverage paired LiDAR-image data present in most autonomous driving datasets to perform multimodal object detection. We suggest that working with multimodal features by leveraging both images and LiDAR point clouds for scene understanding tasks results in object detectors more robust to unseen domain shifts. Second, we train a 3D object detector to learn multimodal object features across different distributions and promote feature invariance across these source domains to improve generalizability to unseen target domains. To this end, we propose CLIX$^\text{3D}$, a multimodal fusion and supervised contrastive learning framework for 3D object detection that performs alignment of object features from same-class samples of different domains while pushing the features from different classes apart. We show that CLIX$^\text{3D}$ yields state-of-the-art domain generalization performance under multiple dataset shifts.
LiDAR数据集在自动驾驶中存在属性偏见,如点云密度、范围和物体尺寸等。因此,在不同的环境中训练和评估的对象检测网络通常会性能下降。域适应方法假设可以从测试分布访问未标注样本来解决这个问题。然而,在现实生活中,在训练过程中访问测试分布的未标注样本可能是不可能的。我们认为更现实和具有挑战性的方法是要求在未见过的目标领域中具有稳健性。为了应对这个问题,我们提出了双支柱的方法。首先,我们利用大多数自动驾驶数据集中存在的成对LiDAR图像数据来执行多模态目标检测。我们建议通过同时利用图像和LiDAR点云进行场景理解任务,使物体检测器对未见过的领域转移更加稳健。其次,我们训练了一个3D物体检测器,以学习不同分布中的多模态物体特征,并促进这些源域之间的特征不变性,以提高对未见过的目标领域的泛化能力。为此,我们提出了CLIX$^\text{3D}$,一个用于3D物体检测的多模态融合监督学习框架,它在不同分布的同一类样本之间进行对象特征的 alignment,同时将不同类别的特征推向远离。我们证明了,CLIX$^\text{3D}$在多个数据集变化下实现了最先进的领域泛化性能。
https://arxiv.org/abs/2404.11764
Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervision signal while also retaining information of geometric relationships between transformed feature representations. This can enable improved performance in downstream tasks that are equivariant to such transformations. In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. Our experiments show that the best performance arises with a pre-training approach that encourages equivariance to translation, scaling, and flip, rotation and scene flow. For spatial augmentations, we find that depending on the transformation, either a contrastive objective or an equivariance-by-classification objective yields best results. To leverage real-world object deformations and motion, we consider sequential LiDAR scene pairs and develop a novel 3D scene flow-based equivariance objective that leads to improved performance overall. We show our pre-training method for 3D object detection which outperforms existing equivariant and invariant approaches in many settings.
流行的表示学习方法鼓励在应用于输入时的变换下保持特征的不变性。然而,在像物体定位和分割这样的3D感知任务中,输出自然地对某些变换(例如旋转)具有等价性。通过使用鼓励在某些变换下保持特征等价的预训练损失函数,可以提供强大的自监督信号,同时保留变换前特征表示之间几何关系的信息。这可以提高在下游具有这种变换的任务的性能。在本文中,我们提出了一种空间和时间等价的表示学习框架,通过同时考虑空间和时间增强。我们的实验表明,最佳性能通过鼓励对平移、缩放和翻转、旋转和场景流动的等价性来实现。对于空间增强,我们发现,根据变换,无论是对比性目标还是类比目标都能获得最佳结果。为了利用真实的物体变形和运动,我们考虑了连续的激光雷达场景对,并开发了一个新的基于3D场景流的三等价目标,这使得整体性能得到提高。我们证明了我们的预训练方法在许多设置中优于现有的等价和不变方法。
https://arxiv.org/abs/2404.11737
This paper introduces Multi-Resolution Rescored Byte-Track (MR2-ByteTrack), a novel video object detection framework for ultra-low-power embedded processors. This method reduces the average compute load of an off-the-shelf Deep Neural Network (DNN) based object detector by up to 2.25$\times$ by alternating the processing of high-resolution images (320$\times$320 pixels) with multiple down-sized frames (192$\times$192 pixels). To tackle the accuracy degradation due to the reduced image input size, MR2-ByteTrack correlates the output detections over time using the ByteTrack tracker and corrects potential misclassification using a novel probabilistic Rescore algorithm. By interleaving two down-sized images for every high-resolution one as the input of different state-of-the-art DNN object detectors with our MR2-ByteTrack, we demonstrate an average accuracy increase of 2.16% and a latency reduction of 43% on the GAP9 microcontroller compared to a baseline frame-by-frame inference scheme using exclusively full-resolution images. Code available at: this https URL
本文介绍了一种名为Multi-Resolution Rescored Byte-Track (MR2-ByteTrack)的新视频对象检测框架,用于低功耗嵌入式处理器。该方法通过交替处理高分辨率图像(320×320像素)和多个低分辨率帧(192×192像素),将基于深度神经网络(DNN)的定制对象检测器的平均计算负载降低至2.25倍。为了应对由于输入图像尺寸减少而导致的准确度下降,MR2-ByteTrack通过ByteTrack跟踪器在时间上相关联输出检测结果,并使用一种新颖的概率Rescore算法纠正潜在的误分类。通过将两个低分辨率图像作为每个高分辨率图像的输入,我们将MR2-ByteTrack应用于具有不同先进状态的DNN物体检测器,在GAP9微控制器上,与仅使用全分辨率图像的基准帧间推理方案相比,我们证明了平均准确度增加2.16%和延迟降低43%。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2404.11488
As large language models (LLMs) become increasingly commonplace, concern about distinguishing between human and AI text increases as well. The growing power of these models is of particular concern to teachers, who may worry that students will use LLMs to write school assignments. Facing a technology with which they are unfamiliar, teachers may turn to publicly-available AI text detectors. Yet the accuracy of many of these detectors has not been thoroughly verified, posing potential harm to students who are falsely accused of academic dishonesty. In this paper, we evaluate three different AI text detectors-Kirchenbauer et al. watermarks, ZeroGPT, and GPTZero-against human and AI-generated essays. We find that watermarking results in a high false positive rate, and that ZeroGPT has both high false positive and false negative rates. Further, we are able to significantly increase the false negative rate of all detectors by using ChatGPT 3.5 to paraphrase the original AI-generated texts, thereby effectively bypassing the detectors.
随着大型语言模型(LLMs)变得越来越普遍,对区分人类和人工智能文本的担忧也在增加。这些模型的日益增长的力量使教师更加关注学生是否会使用LLMs来完成学校作业。对于那些不熟悉这些技术的人来说,教师可能会寻求公开可用的人工智能文本检测器。然而,许多这些检测器的准确性尚未得到充分验证,这可能会对被错误指控为学术不端的学生造成潜在的伤害。在本文中,我们评估了三个人工智能文本检测器-Kirchenbauer等人. watermarks,ZeroGPT和GPTZero-以及人类和人工智能生成的论文。我们发现,水印导致假阳性率很高,而ZeroGPT具有高假阳性和假阴性率。此外,通过使用ChatGPT 3.5对原始人工智能生成的文本进行转述,我们可以显著增加所有检测器的假阴性率,从而有效地绕过检测器。
https://arxiv.org/abs/2404.11408
Object detection tasks, crucial in safety-critical systems like autonomous driving, focus on pinpointing object locations. These detectors are known to be susceptible to backdoor attacks. However, existing backdoor techniques have primarily been adapted from classification tasks, overlooking deeper vulnerabilities specific to object detection. This paper is dedicated to bridging this gap by introducing Detector Collapse} (DC), a brand-new backdoor attack paradigm tailored for object detection. DC is designed to instantly incapacitate detectors (i.e., severely impairing detector's performance and culminating in a denial-of-service). To this end, we develop two innovative attack schemes: Sponge for triggering widespread misidentifications and Blinding for rendering objects invisible. Remarkably, we introduce a novel poisoning strategy exploiting natural objects, enabling DC to act as a practical backdoor in real-world environments. Our experiments on different detectors across several benchmarks show a significant improvement ($\sim$10\%-60\% absolute and $\sim$2-7$\times$ relative) in attack efficacy over state-of-the-art attacks.
翻译: 目标检测任务,在例如自动驾驶等关键安全系统上,关键在于精确确定物体位置。这些检测器已知容易受到后门攻击。然而,现有的后门技术主要来自分类任务,忽视了针对目标检测的更深的漏洞。本文致力于弥补这一差距,通过引入Detector Collapse)(DC),一种全新的针对目标检测的后门攻击范例。DC旨在立即使检测器失效(即严重削弱检测器的性能,导致拒绝服务)。为此,我们开发了两种创新攻击方案:Sponge用于引发广泛的误识别,Blinding用于使物体不可见。值得注意的是,我们利用自然物体引入了一种新的投毒策略,使DC在现实环境充当实际后门。我们在多个基准测试上的实验结果表明,与最先进的攻击相比,攻击效果显著提高(绝对和相对攻击效果分别约为10% - 60%)。
https://arxiv.org/abs/2404.11357
Recently, backdoor attacks have posed a serious security threat to the training process of deep neural networks (DNNs). The attacked model behaves normally on benign samples but outputs a specific result when the trigger is present. However, compared with the rocketing progress of backdoor attacks, existing defenses are difficult to deal with these threats effectively or require benign samples to work, which may be unavailable in real scenarios. In this paper, we find that the poisoned samples and benign samples can be distinguished with prediction entropy. This inspires us to propose a novel dual-network training framework: The Victim and The Beneficiary (V&B), which exploits a poisoned model to train a clean model without extra benign samples. Firstly, we sacrifice the Victim network to be a powerful poisoned sample detector by training on suspicious samples. Secondly, we train the Beneficiary network on the credible samples selected by the Victim to inhibit backdoor injection. Thirdly, a semi-supervised suppression strategy is adopted for erasing potential backdoors and improving model performance. Furthermore, to better inhibit missed poisoned samples, we propose a strong data augmentation method, AttentionMix, which works well with our proposed V&B framework. Extensive experiments on two widely used datasets against 6 state-of-the-art attacks demonstrate that our framework is effective in preventing backdoor injection and robust to various attacks while maintaining the performance on benign samples. Our code is available at this https URL.
近年来,后门攻击对深度神经网络(DNNs)的训练过程构成了严重的安全威胁。受攻击的模型在良性样本上表现正常,但当触发器存在时,它输出特定的结果。然而,与后门攻击的迅速进展相比,现有的防御措施很难有效地处理这些威胁,或者需要良性样本才能工作,这在现实场景中可能不可用。在本文中,我们发现可以通过预测熵区分有毒样本和良性样本。这激发了我们提出一种新颖的双网络训练框架:受害者(V)和受益者(B),它利用有毒模型来训练一个干净的模型,无需额外良性样本。首先,我们通过在可疑样本上训练V来使其成为强大的有毒样本检测器。其次,我们在V选择的可靠样本上训练B来抑制后门注入。第三,采用半监督抑制策略来消除潜在的后门并提高模型性能。此外,为了更好地抑制未检测到的有毒样本,我们提出了一个强大的数据增强方法——AttentionMix,它与我们的V&B框架配合良好。在两个广泛使用的数据集上,对6个最先进的攻击进行的大量实验证明,我们的框架在防止后门注入和应对各种攻击方面非常有效,同时保持良性样本上的性能。我们的代码可在此处下载:https://www.example.com/。
https://arxiv.org/abs/2404.11265
Motivated by the need to improve model performance in traffic monitoring tasks with limited labeled samples, we propose a straightforward augmentation technique tailored for object detection datasets, specifically designed for stationary camera-based applications. Our approach focuses on placing objects in the same positions as the originals to ensure its effectiveness. By applying in-place augmentation on objects from the same camera input image, we address the challenge of overlapping with original and previously selected objects. Through extensive testing on two traffic monitoring datasets, we illustrate the efficacy of our augmentation strategy in improving model performance, particularly in scenarios with limited labeled samples and imbalanced class distributions. Notably, our method achieves comparable performance to models trained on the entire dataset while utilizing only 8.5 percent of the original data. Moreover, we report significant improvements, with mAP@.5 increasing from 0.4798 to 0.5025, and the mAP@.5:.95 rising from 0.29 to 0.3138 on the FishEye8K dataset. These results highlight the potential of our augmentation approach in enhancing object detection models for traffic monitoring applications.
为了在有限标注样本的情况下提高交通监测任务的模型性能,我们提出了一个专门针对物体检测数据集的简单增强技术,尤其针对静止相机应用。我们的方法专注于将物体放置在原始位置相同的位置,以确保其有效性。通过在同一相机输入图像上的对象进行原地增强,我们解决了与原始和之前选择的对象重叠的挑战。在两个交通监测数据集上进行广泛的测试,我们证明了我们在增强策略上取得优异性能,特别是在有限标注样本和类别分布不均衡的场景中。值得注意的是,我们的方法在只使用原始数据的8.5%的情况下,实现了与整个数据集训练的模型相当的表现。此外,我们报道了显著的改进,其中mAP@.5从0.4798增加到0.5025,mAP@.5:.95从0.29增加到0.3138在FishEye8K数据集上。这些结果突出了我们在增强交通监测应用中的物体检测模型潜力。
https://arxiv.org/abs/2404.11226
A significant challenge in the field of object detection lies in the system's performance under non-ideal imaging conditions, such as rain, fog, low illumination, or raw Bayer images that lack ISP processing. Our study introduces "Feature Corrective Transfer Learning", a novel approach that leverages transfer learning and a bespoke loss function to facilitate the end-to-end detection of objects in these challenging scenarios without the need to convert non-ideal images into their RGB counterparts. In our methodology, we initially train a comprehensive model on a pristine RGB image dataset. Subsequently, non-ideal images are processed by comparing their feature maps against those from the initial ideal RGB model. This comparison employs the Extended Area Novel Structural Discrepancy Loss (EANSDL), a novel loss function designed to quantify similarities and integrate them into the detection loss. This approach refines the model's ability to perform object detection across varying conditions through direct feature map correction, encapsulating the essence of Feature Corrective Transfer Learning. Experimental validation on variants of the KITTI dataset demonstrates a significant improvement in mean Average Precision (mAP), resulting in a 3.8-8.1% relative enhancement in detection under non-ideal conditions compared to the baseline model, and a less marginal performance difference within 1.3% of the mAP@[0.5:0.95] achieved under ideal conditions by the standard Faster RCNN algorithm.
在物体检测领域,一个重要的挑战是在非理想成像条件下,例如雨、雾、低照度或原始Bayer图像上,系统的性能。我们的研究引入了一种名为“特征纠正传输学习”的新方法,利用传输学习和自定义损失函数来促进在不需要将非理想图像转换为RGB同义品的情况下,端到端检测这些具有挑战性的场景中的物体。在我们的方法中,我们首先在一个纯净的RGB图像数据集上训练一个全面的模型。随后,通过将非理想图像的特征图与初始理想RGB模型的特征图进行比较来进行处理。这个比较采用了一种名为扩展区域新颖结构差异损失(EANSDL)的新损失函数,这是一种专门用于衡量相似性并将其集成到检测损失中的新损失函数。通过直接特征图纠正来优化模型的能力,包容了特征纠正传输学习的精髓。在KITTI数据集中的变体实验验证了这种方法在非理想条件下检测性能的显著提高,与基线模型相比,非理想条件的检测性能提高了3.8-8.1%,而在理想条件下,标准Faster RCNN算法的mAP@[0.5:0.95]的相对增强只有1.3%。
https://arxiv.org/abs/2404.11214
Compact neural networks are specially designed for applications on edge devices with faster inference speed yet modest performance. However, training strategies of compact models are borrowed from that of conventional models at present, which ignores their difference in model capacity and thus may impede the performance of compact models. In this paper, by systematically investigating the impact of different training ingredients, we introduce a strong training strategy for compact models. We find that the appropriate designs of re-parameterization and knowledge distillation are crucial for training high-performance compact models, while some commonly used data augmentations for training conventional models, such as Mixup and CutMix, lead to worse performance. Our experiments on ImageNet-1K dataset demonstrate that our specialized training strategy for compact models is applicable to various architectures, including GhostNetV2, MobileNetV2 and ShuffleNetV2. Specifically, equipped with our strategy, GhostNetV3 1.3$\times$ achieves a top-1 accuracy of 79.1% with only 269M FLOPs and a latency of 14.46ms on mobile devices, surpassing its ordinarily trained counterpart by a large margin. Moreover, our observation can also be extended to object detection scenarios. PyTorch code and checkpoints can be found at this https URL.
紧凑型神经网络是专门为具有更快的推理速度但性能较低的边缘设备应用而设计的。然而,紧凑模型的训练策略借自于传统的模型,这忽略了它们在模型容量上的差异,从而可能阻碍紧凑模型的性能。在本文中,我们通过系统地研究不同训练成分对紧凑模型性能的影响,引入了一种强大的紧凑模型训练策略。我们发现,适当的重新参数化和知识蒸馏设计对训练高性能紧凑模型至关重要,而一些常用的用于训练传统模型的数据增强技术(如Mixup和CutMix)会导致性能更差。我们在ImageNet-1K数据集上的实验证明,我们为紧凑模型设计的专用训练策略可以应用于各种架构,包括GhostNetV2、MobileNetV2和ShuffleNetV2。特别地,配备我们的策略,GhostNetV3 1.3$\times$在仅使用269M FLOPs的移动设备上实现了 top-1 准确率为79.1%,延迟为14.46ms,比其正常训练的对应模型取得了很大的优势。此外,我们的观察也可以扩展到目标检测场景。PyTorch代码和检查点可以在这个链接上找到。
https://arxiv.org/abs/2404.11202
3D Visual Grounding (3DVG) and 3D Dense Captioning (3DDC) are two crucial tasks in various 3D applications, which require both shared and complementary information in localization and visual-language relationships. Therefore, existing approaches adopt the two-stage "detect-then-describe/discriminate" pipeline, which relies heavily on the performance of the detector, resulting in suboptimal performance. Inspired by DETR, we propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks in an end-to-end fashion. The key idea is to reconsider the prompt-based localization ability of the 3DVG model. In this way, the 3DVG model with a well-designed prompt as input can assist the 3DDC task by extracting localization information from the prompt. In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection, effectively harnessing the existing 3DVG model's inherent localization capacity, thereby boosting 3DDC capability. This integration facilitates simultaneous multi-task training on both tasks, mutually enhancing their performance. Extensive experimental results demonstrate the effectiveness of this approach. Specifically, on the ScanRefer dataset, 3DGCTR surpasses the state-of-the-art 3DDC method by 4.3% in CIDEr@0.5IoU in MLE training and improves upon the SOTA 3DVG method by 3.16% in Acc@0.25IoU.
3D视觉 grounded (3DVG) 和 3D 密集标注 (3DDC) 是各种 3D 应用程序中的两个关键任务,需要在本地化和视觉语言关系中实现共享和互补信息。因此,现有的方法采用两个阶段的“检测-然后描述/区分”流程,对检测器的性能有很高的依赖,导致性能较低。受到 DETR 的启发,我们提出了一个统一的框架,3DGCPTR,以端到端的方式共同解决这两个不同但密切相关的问题。关键思路是重新考虑 3DVG 模型的提示为基础的局部定位能力。以这种方式,具有良好设计的提示作为输入的 3DVG 模型可以协助 3DDC 任务从提示中提取局部定位信息。在实现方面,我们将一个轻量级字幕头集成到现有的 3DVG 网络中,将字幕提示作为连接,有效利用了现有的 3DVG 模型的固有局部定位能力,从而提高了 3DDC 的能力。这种集成同时对两个任务进行多任务训练,相互提高它们的性能。大量的实验结果证明了这种方法的有效性。具体来说,在 ScanRefer 数据集上,3DGCTR 超越了最先进的 3DDC 方法 4.3% 的 CIDEr@0.5IoU 在 MLE 训练中的性能,并将优于最先进的 3DVG 方法 3.16% 的 Acc@0.25IoU。
https://arxiv.org/abs/2404.11064
Vision sensors are versatile and can capture a wide range of visual cues, such as color, texture, shape, and depth. This versatility, along with the relatively inexpensive availability of machine vision cameras, played an important role in adopting vision-based environment perception systems in autonomous vehicles (AVs). However, vision-based perception systems can be easily affected by glare in the presence of a bright source of light, such as the sun or the headlights of the oncoming vehicle at night or simply by light reflecting off snow or ice-covered surfaces; scenarios encountered frequently during driving. In this paper, we investigate various glare reduction techniques, including the proposed saturated pixel-aware glare reduction technique for improved performance of the computer vision (CV) tasks employed by the perception layer of AVs. We evaluate these glare reduction methods based on various performance metrics of the CV algorithms used by the perception layer. Specifically, we considered object detection, object recognition, object tracking, depth estimation, and lane detection which are crucial for autonomous driving. The experimental findings validate the efficacy of the proposed glare reduction approach, showcasing enhanced performance across diverse perception tasks and remarkable resilience against varying levels of glare.
视觉传感器具有多功能,可以捕捉各种视觉线索,如颜色、纹理、形状和深度。这种多功能,再加上机器视觉摄像头相对较低的价格,在自动驾驶车辆(AVs)中采用基于视觉的环境感知系统发挥了重要作用。然而,基于视觉的感知系统很容易受到在明亮光源下存在的眩光的影响,例如太阳或夜间或仅仅是由于雪或冰表面反射的光线;这些情况在驾驶过程中经常遇到。在本文中,我们研究了各种眩光减除技术,包括为提高AV感知层计算机视觉(CV)任务的性能而提出的饱和像素感知眩光减除技术。我们根据CV算法使用感知层所实现的各种性能指标评估这些眩光减除方法。具体来说,我们考虑了物体检测、物体识别、物体跟踪、深度估计和车道检测,这些对于自动驾驶至关重要。实验结果证实了所提出的眩光减除方法的有效性,展示了在各种感知任务中出色的性能和对于眩光水平变化的非凡的鲁棒性。
https://arxiv.org/abs/2404.10992
Methods that can generate synthetic speech which is perceptually indistinguishable from speech recorded by a human speaker, are easily available. Several incidents report misuse of synthetic speech generated from these methods to commit fraud. To counter such misuse, many methods have been proposed to detect synthetic speech. Some of these detectors are more interpretable, can generalize to detect synthetic speech in the wild and are robust to noise. However, limited work has been done on understanding bias in these detectors. In this work, we examine bias in existing synthetic speech detectors to determine if they will unfairly target a particular gender, age and accent group. We also inspect whether these detectors will have a higher misclassification rate for bona fide speech from speech-impaired speakers w.r.t fluent speakers. Extensive experiments on 6 existing synthetic speech detectors using more than 0.9 million speech signals demonstrate that most detectors are gender, age and accent biased, and future work is needed to ensure fairness. To support future research, we release our evaluation dataset, models used in our study and source code at this https URL.
方法:生成可以让人类说话者感知无法分辨的合成语音的方法很容易获得。几起报道指出,这些方法生成的合成语音被用于欺诈行为。为了应对这种滥用,已经提出了许多方法来检测这些方法生成的合成语音。有些检测器更加可解释,可以扩展以在野外检测合成语音,对噪声有鲁棒性。然而,在理解这些检测器的偏见方面,目前的工作还很少。在这项工作中,我们检查现有合成语音检测器的偏见,以确定它们是否不公平地针对某个性别、年龄和口音组。我们还检查这些检测器对真实语音有较高误分类率的情况,特别是对于流畅说话者。使用超过0.9亿个语音信号对6个现有合成语音探测器进行广泛实验证明,大多数检测器都有性别、年龄和口音偏见,未来需要进一步研究以确保公正。为支持未来的研究,我们发布了我们的评估数据集、本研究中使用的模型及源代码,可在该链接处访问。
https://arxiv.org/abs/2404.10989
The integration of Light Detection and Ranging (LiDAR) and Internet of Things (IoT) technologies offers transformative opportunities for public health informatics in urban safety and pedestrian well-being. This paper proposes a novel framework utilizing these technologies for enhanced 3D object detection and activity classification in urban traffic scenarios. By employing elevated LiDAR, we obtain detailed 3D point cloud data, enabling precise pedestrian activity monitoring. To overcome urban data scarcity, we create a specialized dataset through simulated traffic environments in Blender, facilitating targeted model training. Our approach employs a modified Point Voxel-Region-based Convolutional Neural Network (PV-RCNN) for robust 3D detection and PointNet for classifying pedestrian activities, significantly benefiting urban traffic management and public health by offering insights into pedestrian behavior and promoting safer urban environments. Our dual-model approach not only enhances urban traffic management but also contributes significantly to public health by providing insights into pedestrian behavior and promoting safer urban environment.
利用光探测和测距(LiDAR)和物联网(IoT)技术在智慧城市和安全行人环境中,集成LiDAR和IoT技术为公共健康信息学带来了变革性的机会。本文提出了一种利用这些技术在城市交通场景中增强3D物体检测和活动分类的新颖框架。通过采用升高的LiDAR技术,我们获得了详细的三维点云数据,实现了对行人活动的精确监测。为了克服城市数据的稀缺性,我们在Blender中通过模拟交通环境创建了一个专门的数据集,从而实现针对模型的定向训练。我们采用了一种基于点体素区域卷积神经网络(PV-RCNN)的修改后的检测和PointNet进行分类的方法,显著提高了智能交通管理和公共卫生水平,通过提供对行人行为的见解,促进了更安全的城市环境。我们双模型方法不仅提高了智能交通管理,而且对公共卫生也做出了显著贡献,提供了对行人行为的见解,促进了更安全的城市环境。
https://arxiv.org/abs/2404.10978
An object detector's ability to detect and flag \textit{novel} objects during open-world deployments is critical for many real-world applications. Unfortunately, much of the work in open object detection today is disjointed and fails to adequately address applications that prioritize unknown object recall \textit{in addition to} known-class accuracy. To close this gap, we present a new task called Open-Set Object Detection and Discovery (OSODD) and as a solution propose the Open-Set Regions with ViT features (OSR-ViT) detection framework. OSR-ViT combines a class-agnostic proposal network with a powerful ViT-based classifier. Its modular design simplifies optimization and allows users to easily swap proposal solutions and feature extractors to best suit their application. Using our multifaceted evaluation protocol, we show that OSR-ViT obtains performance levels that far exceed state-of-the-art supervised methods. Our method also excels in low-data settings, outperforming supervised baselines using a fraction of the training data.
物体检测器在开放世界部署中检测和标记新物体的能力对于许多现实应用至关重要。然而,目前开源物体检测中的大部分工作都是不连贯的,没有充分解决优先考虑未知物体召回(除了已知类准确度之外)的应用程序。为了填补这一空白,我们提出了一个名为Open-Set Object Detection and Discovery(OSODD)的新任务,并作为解决方案提出了一种名为Open-Set Regions with ViT features(OSR-ViT)的检测框架。OSR-ViT结合了一个类无关的提议网络和一个强大的基于ViT的分类器。其模块化设计简化了优化,并允许用户轻松交换建议解决方案和特征提取器,以最适合他们的应用程序。通过我们的多维度评估协议,我们证明了OSR-ViT取得的性能水平远远超过了最先进的监督方法。我们的方法在低数据设置中也表现出色,训练数据用量的分数就击败了监督基线。
https://arxiv.org/abs/2404.10865
The aim of this work is to establish how accurately a recent semantic-based foveal active perception model is able to complete visual tasks that are regularly performed by humans, namely, scene exploration and visual search. This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations. It has been used previously in scene exploration tasks. In this paper, we revisit the model and extend its application to visual search tasks. To illustrate the benefits of using semantic information in scene exploration and visual search tasks, we compare its performance against traditional saliency-based models. In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model in accurately representing the semantic information present in the visual scene. In visual search experiments, searching for instances of a target class in a visual field containing multiple distractors shows superior performance compared to the saliency-driven model and a random gaze selection algorithm. Our results demonstrate that semantic information, from the top-down, influences visual exploration and search tasks significantly, suggesting a potential area of research for integrating it with traditional bottom-up cues.
本文旨在探讨一个最近基于语义的信息提取的视野主动感知模型的准确性和其在完成人类通常执行的视觉任务(场景探索和视觉搜索)方面的能力。该模型利用当前物体检测器定位和分类大量物体类别的功能,并在多个注视点上更新场景的语义描述。它在之前用于场景探索任务中已经应用过。在本文中,我们重新审视了该模型,并将其应用于视觉搜索任务。为了说明在场景探索和视觉搜索任务中使用语义信息的优势,我们将其性能与传统基于 saliency 的模型进行比较。在场景探索任务中,基于语义的方法在准确表示视觉场景中的语义信息方面表现出优越性能。在视觉搜索实验中,在包含多个干扰物的视觉区域内搜索目标类别的实例,与基于 saliency 的模型和随机注视选择算法相比,表现出优越性能。我们的结果表明,从上到下,语义信息对视觉探索和搜索任务具有显著影响,这表明了一个可能的研究领域,将语义信息与传统自上而下提示相结合。
https://arxiv.org/abs/2404.10836
Anomaly detection (AD) is often focused on detecting anomaly areas for industrial quality inspection and medical lesion examination. However, due to the specific scenario targets, the data scale for AD is relatively small, and evaluation metrics are still deficient compared to classic vision tasks, such as object detection and semantic segmentation. To fill these gaps, this work first constructs a large-scale and general-purpose COCO-AD dataset by extending COCO to the AD field. This enables fair evaluation and sustainable development for different methods on this challenging benchmark. Moreover, current metrics such as AU-ROC have nearly reached saturation on simple datasets, which prevents a comprehensive evaluation of different methods. Inspired by the metrics in the segmentation field, we further propose several more practical threshold-dependent AD-specific metrics, ie, m$F_1$$^{.2}_{.8}$, mAcc$^{.2}_{.8}$, mIoU$^{.2}_{.8}$, and mIoU-max. Motivated by GAN inversion's high-quality reconstruction capability, we propose a simple but more powerful InvAD framework to achieve high-quality feature reconstruction. Our method improves the effectiveness of reconstruction-based methods on popular MVTec AD, VisA, and our newly proposed COCO-AD datasets under a multi-class unsupervised setting, where only a single detection model is trained to detect anomalies from different classes. Extensive ablation experiments have demonstrated the effectiveness of each component of our InvAD. Full codes and models are available at this https URL.
异常检测(AD)通常关注工业质量检测和医学伤口检验中的异常区域检测。然而,由于特定场景的目标,AD的数据规模相对较小,与经典视觉任务(如物体检测和语义分割)相比,评估指标仍然缺乏。为了填补这些空白,本文首先通过在AD领域扩展COCO来构建一个大规模且通用的COCO-AD数据集。这使得对于这个具有挑战性的基准,不同方法具有公平的评估和可持续的发展。此外,如分割域中的指标一样,目前的AU-ROC指标在简单的数据集上几乎达到饱和,这阻止了不同方法的全面评估。受到分割域指标的启发,我们进一步提出了几个更具体的AD特定指标,即m$F_1^{.2}_{.8}$,mAcc$^{.2}_{.8}$,mIoU$^{.2}_{.8}$和mIoU-max。受到GAN逆变换高质量重构能力的影响,我们提出了一个简单但更强大的InvAD框架,以实现高质量特征重构。我们的方法在多类无监督设置中提高了基于重构的复原方法在流行MVTec AD,VisA和我们所提出的COCO-AD数据集上的有效性。广泛的消融实验证明了每个组件的有效性。完整代码和模型可以从该链接的https URL中获取。
https://arxiv.org/abs/2404.10760