The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.
在零样本开放词汇检测中,核心问题是如何对齐视觉和文本特征,以便检测器在未训练过的类上表现良好。以前的算法从开始训练就开始训练特征金字塔和检测头,这破坏了在预训练期间建立的视觉文本特征对齐,并努力防止语言模型忘记未训练过的类。我们提出了三种方法来缓解这些问题。第一种方法是使用简单的方案来增加文本嵌入,以防止在训练期间看到的少数类上过度拟合,同时同时节省内存和计算。第二种方法是修改特征金字塔网络和检测头,包括可训练的门控快捷方式,这鼓励视觉文本特征对齐,并在检测训练开始时保证它。最后一种方法是利用更大的图像文本对语料库,从而提高检测在这些类上没有人类标注 bounding box 的检测性能。我们三种方法在 LVIS 基准测试的零样本版本上进行评估,每个方法都表现出明显和重要的 benefits。我们的最终网络在 mAP-all 度量上实现了新的前沿技术,并表现出 mAP-罕见的类上的 competitive 性能,以及与 COCO 和 Object365 相比更好的传输性能。
https://arxiv.org/abs/2303.13518
This paper introduces the Masked Voxel Jigsaw and Reconstruction (MV-JAR) method for LiDAR-based self-supervised pre-training and a carefully designed data-efficient 3D object detection benchmark on the Waymo dataset. Inspired by the scene-voxel-point hierarchy in downstream 3D object detectors, we design masking and reconstruction strategies accounting for voxel distributions in the scene and local point distributions within the voxel. We employ a Reversed-Furthest-Voxel-Sampling strategy to address the uneven distribution of LiDAR points and propose MV-JAR, which combines two techniques for modeling the aforementioned distributions, resulting in superior performance. Our experiments reveal limitations in previous data-efficient experiments, which uniformly sample fine-tuning splits with varying data proportions from each LiDAR sequence, leading to similar data diversity across splits. To address this, we propose a new benchmark that samples scene sequences for diverse fine-tuning splits, ensuring adequate model convergence and providing a more accurate evaluation of pre-training methods. Experiments on our Waymo benchmark and the KITTI dataset demonstrate that MV-JAR consistently and significantly improves 3D detection performance across various data scales, achieving up to a 6.3% increase in mAPH compared to training from scratch. Codes and the benchmark will be available at this https URL .
本文介绍了基于激光雷达的自我监督前训练方法Masked Voxel Jigsaw and Reconstruction(MV-JAR),以及在谷歌自动驾驶数据集上精心设计的高效数据检测基准。受到后续3D物体检测器中场景点、像素级 hierarchy的灵感,我们设计Masked和 Reconstruction策略,考虑场景点和 voxel 点分布情况。我们采用Reversed-Furthest-Voxel-Sampling策略来解决 LiDAR 点分布不均的问题,并提出了MV-JAR,它结合了两种技术,用于建模上述分布,结果表现优异。我们的实验揭示了先前高效实验的局限性,这些实验uniformly samples fine-tuningsplits with varying data proportion from each LiDAR sequence,导致在不同split之间存在类似数据多样性的问题。为了解决这个问题,我们提出了一个新的基准,样本场景序列以不同的 fine-tuningsplit,确保模型充分收敛,并提供更精确的训练方法评估。在谷歌自动驾驶数据和KITTI数据集上的实验表明,MV-JAR consistently and significantly improves 3D检测性能,在各种数据尺度上实现6.3%的性能提升,相对于从头开始训练。代码和基准将在这httpsURL上提供。
https://arxiv.org/abs/2303.13510
DEtection TRansformer (DETR) started a trend that uses a group of learnable queries for unified visual perception. This work begins by applying this appealing paradigm to LiDAR-based point cloud segmentation and obtains a simple yet effective baseline. Although the naive adaptation obtains fair results, the instance segmentation performance is noticeably inferior to previous works. By diving into the details, we observe that instances in the sparse point clouds are relatively small to the whole scene and often have similar geometry but lack distinctive appearance for segmentation, which are rare in the image domain. Considering instances in 3D are more featured by their positional information, we emphasize their roles during the modeling and design a robust Mixed-parameterized Positional Embedding (MPE) to guide the segmentation process. It is embedded into backbone features and later guides the mask prediction and query update processes iteratively, leading to Position-Aware Segmentation (PA-Seg) and Masked Focal Attention (MFA). All these designs impel the queries to attend to specific regions and identify various instances. The method, named Position-guided Point cloud Panoptic segmentation transFormer (P3Former), outperforms previous state-of-the-art methods by 3.4% and 1.2% PQ on SemanticKITTI and nuScenes benchmark, respectively. The source code and models are available at this https URL .
DEtectionTRansformer(DETR)开始使用一组可学习查询来实现统一的视觉感知。这项工作首先将这一吸引人的范式应用于基于激光雷达的点云分割,并获得了简单但有效的基线。虽然简单的适应方法获得了公正的结果,但实例分割性能明显低于以前的工作。通过深入研究细节,我们发现稀疏点云实例相对于整个场景来说相对较小,往往具有相似的几何形状,但在分割方面缺乏独特的外观,这在图像领域非常罕见。考虑到3D实例更多地取决于其位置信息,我们在建模期间强调它们的作用,设计了一个稳健的混合参数化位置嵌入(MPE),以指导分割过程。它被嵌入到基线特征中,然后迭代地指导掩码预测和查询更新过程,导致位置Aware分割(PA-Seg)和掩码焦点注意(MFA)。所有这些设计都促使查询关注特定的区域并识别各种实例。该方法被称为位置引导点云 Panoptic 分割转换器(P3 former),在语义KITTI和nuScenes基准测试中分别比以前的先进方法高出3.4%和1.2%。源代码和模型可在该httpsURL上提供。
https://arxiv.org/abs/2303.13509
This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.3%), 1-shot ImageNet-1k (62.1%), and zero-shot transfer on Food-101 (96.0%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images.
这篇文章重写了计算机视觉中用于视觉识别任务的标准预训练-再微调范式。通常,最先进的基础模型是通过大规模(较弱)监督数据集训练的,包含数百万图像。我们引入了一个简单的预训练阶段,并使用自监督MAE技术初始化模型。虽然MAE只表现出与模型大小的关系,但我们发现它与训练数据集大小也有关系。因此,我们的MAE基于预训练方法适用于训练基础模型。预训练 consistently 改善模型收敛和下游转移性能,涵盖了模型大小(数百万到数十亿参数)和数据大小(数百万到数十亿图像)。我们测试了10个不同的视觉识别任务,包括图像分类、视频识别、对象检测、低尺度分类和零尺度识别。我们最大的模型在iNaturalist-18上取得了新的最先进的结果(91.3%),在1-视角的ImageNet-1k上取得了62.1%的准确率,并在Food-101上实现了零视角转移(96.0%)。我们的研究表明,模型初始化在包含数十亿图像的大规模预训练任务中发挥着重要作用。
https://arxiv.org/abs/2303.13496
In TV services, dialogue level personalization is key to meeting user preferences and needs. When dialogue and background sounds are not separately available from the production stage, Dialogue Separation (DS) can estimate them to enable personalization. DS was shown to provide clear benefits for the end user. Still, the estimated signals are not perfect, and some leakage can be introduced. This is undesired, especially during passages without dialogue. We propose to combine DS and Voice Activity Detection (VAD), both recently proposed for TV audio. When their combination suggests dialogue inactivity, background components leaking in the dialogue estimate are reassigned to the background estimate. A clear improvement of the audio quality is shown for dialogue-free signals, without performance drops when dialogue is active. A post-processed VAD estimate with improved detection accuracy is also generated. It is concluded that DS and VAD can improve each other and are better used together.
在电视服务中,对话水平个性化是满足用户偏好和需求的关键。当对话和背景声从生产阶段单独获得时,对话分离(DS)可以估计它们,实现个性化。DS已被证明为用户提供了明确的好处。然而,估计的信号并不是完美的,可能会有一些泄漏。这不希望发生,特别是在没有对话的情况下。我们建议将对话分离和语音活动检测(VAD)两项最近为电视音频提出的技术结合起来。当它们的组合提示对话未活动时,背景成分在对话估计中泄漏的部分会被重新分配给背景估计。在没有对话的信号中,音频质量明显改进,而在对话活动时则没有性能下降。还生成了改进了检测精度的 post-processed VAD估计。因此,得出结论,DS和VAD可以互相改进,最好一起使用。
https://arxiv.org/abs/2303.13453
To detect the deployment of large language models for malicious use cases (e.g., fake content creation or academic plagiarism), several approaches have recently been proposed for identifying AI-generated text via watermarks or statistical irregularities. How robust are these detection algorithms to paraphrases of AI-generated text? To stress test these detectors, we first train an 11B parameter paraphrase generation model (DIPPER) that can paraphrase paragraphs, optionally leveraging surrounding text (e.g., user-written prompts) as context. DIPPER also uses scalar knobs to control the amount of lexical diversity and reordering in the paraphrases. Paraphrasing text generated by three large language models (including GPT3.5-davinci-003) with DIPPER successfully evades several detectors, including watermarking, GPTZero, DetectGPT, and OpenAI's text classifier. For example, DIPPER drops the detection accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. To increase the robustness of AI-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider. Given a candidate text, our algorithm searches a database of sequences previously generated by the API, looking for sequences that match the candidate text within a certain threshold. We empirically verify our defense using a database of 15M generations from a fine-tuned T5-XXL model and find that it can detect 80% to 97% of paraphrased generations across different settings, while only classifying 1% of human-written sequences as AI-generated. We will open source our code, model and data for future research.
检测大型语言模型用于恶意使用 case(例如虚假内容创建或学术抄袭),有几个方法最近被提出以通过水印或统计不规则来确定由人工智能生成的文字。这些检测算法对人工智能生成文字的重写攻击的鲁棒性如何?为了压力测试这些检测算法,我们首先训练了一个11B参数的重写生成模型(DIPPER),该模型可以重写段落,并可选地利用周围的文本(例如用户编写的提示)作为上下文。DIPPER还使用 scalar knobs 控制重写句中的词汇多样性和排序。由三个大型语言模型(包括 GPT3.5-davinci-003)生成的重写文本,使用 DIPPER成功地逃避了多个检测器,包括水印、GPTZero、DetectGPT和OpenAI的文字分类器。例如,DIPPER将检测到 DetectGPT的检测准确率从70.3%降低到4.6%(在 constant false positive rate of 1% 不变的情况下),而不会对输入语义性有任何显著影响。为了增加人工智能生成文字检测重写攻击的鲁棒性,我们引入了一种简单的防御措施,它依赖于从语言模型 API 中提取语义相似的生成,必须由语言模型 API 提供商维护。给定一个候选文本,我们的算法搜索先前由 API 生成的序列数据库,寻找匹配候选文本在一定阈值内的序列。我们使用微调的 T5-XXL 模型的15M 生成序列数据库进行经验验证,发现它可以在不同设置下检测到80%至97%的重写生成序列,而仅将人类编写的序列归类为人工智能生成。我们将开源我们的代码、模型和数据,以供未来的研究。
https://arxiv.org/abs/2303.13408
Instrument playing technique (IPT) is a key element of musical presentation. However, most of the existing works for IPT detection only concern monophonic music signals, yet little has been done to detect IPTs in polyphonic instrumental solo pieces with overlapping IPTs or mixed IPTs. In this paper, we formulate it as a frame-level multi-label classification problem and apply it to Guzheng, a Chinese plucked string instrument. We create a new dataset, Guzheng\_Tech99, containing Guzheng recordings and onset, offset, pitch, IPT annotations of each note. Because different IPTs vary a lot in their lengths, we propose a new method to solve this problem using multi-scale network and self-attention. The multi-scale network extracts features from different scales, and the self-attention mechanism applied to the feature maps at the coarsest scale further enhances the long-range feature extraction. Our approach outperforms existing works by a large margin, indicating its effectiveness in IPT detection.
乐器演奏技巧(IPT)是音乐呈现的关键元素。然而,大部分现有的IPT检测工作仅关注单音音乐信号,但在具有重叠IPT或混合IPT的复音乐器独奏作品中,检测IPTs仍然是一项挑战。在本文中,我们将IPT问题建模为帧级别的多标签分类问题,并将其应用于Gzheng,一种中国拉弦乐器。我们创建了一个新的数据集Gzheng_Tech99,包含Gzheng录制的每个音符的触发器、触发器位置、音高和IPT标注。由于不同IPT的长度差异很大,我们提出了一种使用多尺度网络和自注意力的方法来解决此问题的新方法。多尺度网络从不同尺度提取特征,而自注意力机制则应用于特征图的最粗尺度上,进一步增强了长距离特征提取。我们的方法比现有方法表现优异,这表明它在IPT检测方面的有效性。
https://arxiv.org/abs/2303.13272
Universal anomaly detection still remains a challenging prob- lem in machine learning and medical image analysis. It is possible to learn an expected distribution from a single class of normative samples, e.g., through epistemic uncertainty estimates, auto-encoding models, or from synthetic anomalies in a self-supervised way. The performance of self-supervised anomaly detection approaches is still inferior compared to methods that use examples from known unknown classes to shape the decision boundary. However, outlier exposure methods often do not identify unknown unknowns. Here we discuss an improved self-supervised single-class training strategy that supports the approximation of proba- bilistic inference with loosen feature locality constraints. We show that up-scaling of gradients with histogram-equalised images is beneficial for recently proposed self-supervision tasks. Our method is integrated into several out-of-distribution (OOD) detection models and we show evi- dence that our method outperforms the state-of-the-art on various bench- mark datasets. Source code will be publicly available by the time of the conference.
普遍异常检测仍然是机器学习和医学图像分析中一个挑战性的问题。从一类校准样本中学习期望分布,例如通过知识不确定性估计、自动编码模型或通过自监督的方式来合成异常样本。自监督异常检测方法的性能仍然比使用已知未知类样本来 shaping决策边界的方法差。然而,异常暴露方法通常无法识别未知未知例。在此我们讨论了改进的自监督单个类训练策略,支持放宽特征局部限制的概率推断推断。我们表明,对梯度图像进行直方图均衡化可以提高最近提出的自监督任务的性能。我们的方法被集成到多个非分布检测模型中,并证据表明,我们的方法在多个基准数据集上比当前最先进的方法表现更好。源代码将在会议结束后公开可用。
https://arxiv.org/abs/2303.13227
Few-shot object detection (FSOD) aims to expand an object detector for novel categories given only a few instances for training. The few training samples restrict the performance of FSOD model. Recent text-to-image generation models have shown promising results in generating high-quality images. How applicable these synthetic images are for FSOD tasks remains under-explored. This work extensively studies how synthetic images generated from state-of-the-art text-to-image generators benefit FSOD tasks. We focus on two perspectives: (1) How to use synthetic data for FSOD? (2) How to find representative samples from the large-scale synthetic dataset? We design a copy-paste-based pipeline for using synthetic data. Specifically, saliency object detection is applied to the original generated image, and the minimum enclosing box is used for cropping the main object based on the saliency map. After that, the cropped object is randomly pasted on the image, which comes from the base dataset. We also study the influence of the input text of text-to-image generator and the number of synthetic images used. To construct a representative synthetic training dataset, we maximize the diversity of the selected images via a sample-based and cluster-based method. However, the severe problem of high false positives (FP) ratio of novel categories in FSOD can not be solved by using synthetic data. We propose integrating CLIP, a zero-shot recognition model, into the FSOD pipeline, which can filter 90% of FP by defining a threshold for the similarity score between the detected object and the text of the predicted category. Extensive experiments on PASCAL VOC and MS COCO validate the effectiveness of our method, in which performance gain is up to 21.9% compared to the few-shot baseline.
有限对象检测(FSOD)旨在扩展对新分类类别的对象检测器,仅提供少数实例进行训练。这些训练样本限制了FSOD模型的性能。最近,生成式文本到图像生成模型在生成高质量的图像方面表现出良好的结果。这些合成图像对于FSOD任务的应用仍然未被充分探索。本文深入研究了如何从先进的文本到图像生成模型中生成合成图像,以改善FSOD任务。我们关注两个方面:(1)如何以复制粘贴的方式使用合成数据进行FSOD任务?(2)如何从大型合成数据集中查找代表性样本?我们设计了一个基于复制粘贴的 pipeline 用于使用合成数据。具体而言,我们通过使用原始生成图像中的关注对象进行对象检测,并使用最小包围盒根据关注映射进行裁剪的主要对象。之后,裁剪对象随机粘贴到来自基础数据集的图像上。我们还研究了输入文本文本生成器和使用合成图像的数量对所选图像的影响。为了构建一个代表性的合成训练集,我们通过样本方法和簇方法最大地扩展了选择的样本的多样性。但在FSOD中,新分类类别的高误报率(FP)比例的严重问题不能用合成数据来解决。我们提出了将 CLIP 一种零次识别模型集成到FSOD管道中的方法,该方法可以通过定义相似性得分之间的检测到对象和预测类别文本的阈值过滤90%的FP。在PASCAL VOC 和 MS COCO 等数据集上的广泛实验验证了我们方法的有效性,其性能提升高达21.9%。与有限对象检测基准线相比。
https://arxiv.org/abs/2303.13221
Knowledge distillation is a popular technique for transferring the knowledge from a large teacher model to a smaller student model by mimicking. However, distillation by directly aligning the feature maps between teacher and student may enforce overly strict constraints on the student thus degrade the performance of the student model. To alleviate the above feature misalignment issue, existing works mainly focus on spatially aligning the feature maps of the teacher and the student, with pixel-wise transformation. In this paper, we newly find that aligning the feature maps between teacher and student along the channel-wise dimension is also effective for addressing the feature misalignment issue. Specifically, we propose a learnable nonlinear channel-wise transformation to align the features of the student and the teacher model. Based on it, we further propose a simple and generic framework for feature distillation, with only one hyper-parameter to balance the distillation loss and the task specific loss. Extensive experimental results show that our method achieves significant performance improvements in various computer vision tasks including image classification (+3.28% top-1 accuracy for MobileNetV1 on ImageNet-1K), object detection (+3.9% bbox mAP for ResNet50-based Faster-RCNN on MS COCO), instance segmentation (+2.8% Mask mAP for ResNet50-based Mask-RCNN), and semantic segmentation (+4.66% mIoU for ResNet18-based PSPNet in semantic segmentation on Cityscapes), which demonstrates the effectiveness and the versatility of the proposed method. The code will be made publicly available.
知识蒸馏是一种流行的技术,通过模拟从大型教师模型向小型学生模型传输知识来实现。然而,直接对齐教师和学生的特征映射可能强制学生接受过于严格的约束,从而损害学生模型的性能。为了减轻上述特征不匹配的问题,现有的工作主要关注在空间上对齐教师和学生的特征映射,使用像素转换。在本文中,我们新发现,沿着通道维度对齐教师和学生的特征映射也可以有效地解决特征不匹配问题。具体来说,我们提出了一种可学习非线性通道转换来对齐学生和教师模型的特征。基于它,我们进一步提出了一种简单而通用的特征蒸馏框架,只有一个超参数来平衡蒸馏损失和任务特定损失。广泛的实验结果表明,我们的方法在多种计算机视觉任务中实现了显著的性能改进,包括图像分类(+3.28%的 ImageNet-1K 上 Top-1 准确性)、物体检测(+3.9%的 MS COCO 上的 bbox mAP)、实例分割(+2.8%的 ResNet50 上的Mask-RCNN Mask mAP)、语义分割(+4.66%的 ResNet18 上的PSPNet 在 Cityscapes 上的语义分割 mIoU),这表明了我们方法的有效性和灵活性。代码将公开可用。
https://arxiv.org/abs/2303.13212
In this paper we investigate the frequency sensitivity of Deep Neural Networks (DNNs) when presented with clean samples versus poisoned samples. Our analysis shows significant disparities in frequency sensitivity between these two types of samples. Building on these findings, we propose FREAK, a frequency-based poisoned sample detection algorithm that is simple yet effective. Our experimental results demonstrate the efficacy of FREAK not only against frequency backdoor attacks but also against some spatial attacks. Our work is just the first step in leveraging these insights. We believe that our analysis and proposed defense mechanism will provide a foundation for future research and development of backdoor defenses.
在本文中,我们研究了 Deep Neural Networks (DNNs) 在面对干净样本和有毒样本时的频率响应。我们的分析表明,这两种样本之间存在着显著的差异。基于这些发现,我们提出了FREAK,一个基于频率的有毒样本检测算法,其简单而有效。我们的实验结果表明,FREAK不仅可以对抗频率恶意后门攻击,还可以对抗某些空间攻击。我们的工作只是利用这些 insights 的第一步。我们相信,我们的分析和提出的防御机制将为未来恶意后门防御的研究和发展提供基础。
https://arxiv.org/abs/2303.13211
Knee OsteoArthritis (KOA) is a prevalent musculoskeletal disorder that causes decreased mobility in seniors. The diagnosis provided by physicians is subjective, however, as it relies on personal experience and the semi-quantitative Kellgren-Lawrence (KL) scoring system. KOA has been successfully diagnosed by Computer-Aided Diagnostic (CAD) systems that use deep learning techniques like Convolutional Neural Networks (CNN). In this paper, we propose a novel Siamese-based network, and we introduce a new hybrid loss strategy for the early detection of KOA. The model extends the classical Siamese network by integrating a collection of Global Average Pooling (GAP) layers for feature extraction at each level. Then, to improve the classification performance, a novel training strategy that partitions each training batch into low-, medium- and high-confidence subsets, and a specific hybrid loss function are used for each new label attributed to each sample. The final loss function is then derived by combining the latter loss functions with optimized weights. Our test results demonstrate that our proposed approach significantly improves the detection performance.
Knee OsteoArthritis (KOA) 是一种普遍存在的骨关节炎,会导致老年人减少 mobility。然而,医生提供的诊断仍然是主观的,因为它依赖于个人经验和半定量的凯蒙德-拉森(KL)评分系统。KOA 通过使用使用深度学习技术如卷积神经网络(CNN)的计算机辅助诊断(CAD)系统而被成功诊断。在本文中,我们提出了一种新的对称神经网络,并介绍了一种用于早期检测 KOA 的新混合损失策略。模型通过将每个级别的特征提取集合中的全局平均池化层(GAP)集成起来,扩展了传统的对称神经网络。为了改善分类性能,我们提出了一种新的训练策略,将每个训练批次分为低、中和高信噪比的子集,并为每个样本分配每个新标签使用的特定混合损失函数。最后,通过将后一种损失函数与优化权重相结合,得出最终的 loss 函数。我们的测试结果表明,我们提出的这种方法显著提高了检测性能。
https://arxiv.org/abs/2303.13203
Point cloud (PCD) anomaly detection steadily emerges as a promising research area. This study aims to improve PCD anomaly detection performance by combining handcrafted PCD descriptions with powerful pre-trained 2D neural networks. To this end, this study proposes Complementary Pseudo Multimodal Feature (CPMF) that incorporates local geometrical information in 3D modality using handcrafted PCD descriptors and global semantic information in the generated pseudo 2D modality using pre-trained 2D neural networks. For global semantics extraction, CPMF projects the origin PCD into a pseudo 2D modality containing multi-view images. These images are delivered to pre-trained 2D neural networks for informative 2D modality feature extraction. The 3D and 2D modality features are aggregated to obtain the CPMF for PCD anomaly detection. Extensive experiments demonstrate the complementary capacity between 2D and 3D modality features and the effectiveness of CPMF, with 95.15% image-level AU-ROC and 92.93% pixel-level PRO on the MVTec3D benchmark. Code is available on this https URL.
点云(PCD)异常检测逐渐成为一个有前途的研究领域。本研究旨在通过结合手工编写的点云描述与强大的预训练2D神经网络,提高PCD异常检测性能。为此,本研究提出了互补的伪多模态特征(CPMF),该特征使用手工编写的点云描述将3D模态中的 local 几何信息与使用预训练2D神经网络生成的伪2D模态中的 global 语义信息相结合。为了获取全局语义信息,CPMF将点云的起源点转换为包含多视角图像的伪2D模态。这些图像被发送到预训练2D神经网络进行2D模态信息 informative 特征提取。3D和2D模态特征的聚合得到了PCD异常检测的CPMF。广泛的实验结果表明,2D和3D模态特征之间的互补能力以及CPMF的有效性,在MVTec3D基准测试中,PCD异常检测的性能达到95.15%。代码可在本https URL上获取。
https://arxiv.org/abs/2303.13194
Out-of-distribution detection is a common issue in deploying vision models in practice and solving it is an essential building block in safety critical applications. Existing OOD detection solutions focus on improving the OOD robustness of a classification model trained exclusively on in-distribution (ID) data. In this work, we take a different approach and propose to leverage generic pre-trained representations. We first investigate the behaviour of simple classifiers built on top of such representations and show striking performance gains compared to the ID trained representations. We propose a novel OOD method, called GROOD, that achieves excellent performance, predicated by the use of a good generic representation. Only a trivial training process is required for adapting GROOD to a particular problem. The method is simple, general, efficient, calibrated and with only a few hyper-parameters. The method achieves state-of-the-art performance on a number of OOD benchmarks, reaching near perfect performance on several of them. The source code is available at this https URL.
分布不平衡检测是在实践中部署视觉模型的常见问题,并且解决它是安全关键应用中不可或缺的基本构建块。现有的分布不平衡检测解决方案专注于改善专门训练在分布(ID)数据上的分类模型的分布不平衡鲁棒性。在本研究中,我们采用了不同的方法,并提出了利用通用预训练表示的方法。我们首先研究了这些表示上简单的分类器的行为方式,并比ID训练表示表现出惊人的性能提升。我们提出了一种称为GROOD的新分布不平衡检测方法,它通过使用良好的通用表示实现卓越的性能。只需要一个简单的训练过程即可适应GROOD以特定问题。该方法简单、通用、高效、校准,且只需要几个超参数。该方法在多个分布不平衡基准测试中实现了最先进的性能,并在其中的几个测试中表现出几乎完美的性能。源代码可在这个httpsURL上获取。
https://arxiv.org/abs/2303.13148
Recently, face swapping has been developing rapidly and achieved a surprising reality, raising concerns about fake content. As a countermeasure, various detection approaches have been proposed and achieved promising performance. However, most existing detectors struggle to maintain performance on unseen face swapping methods and low-quality images. Apart from the generalization problem, current detection approaches have been shown vulnerable to evasion attacks crafted by detection-aware manipulators. Lack of robustness under adversary scenarios leaves threats for applying face swapping detection in real world. In this paper, we propose a novel face swapping detection approach based on face identification probability distributions, coined as IdP_FSD, to improve the generalization and robustness. IdP_FSD is specially designed for detecting swapped faces whose identities belong to a finite set, which is meaningful in real-world applications. Compared with previous general detection methods, we make use of the available real faces with concerned identities and require no fake samples for training. IdP_FSD exploits face swapping's common nature that the identity of swapped face combines that of two faces involved in swapping. We reflect this nature with the confusion of a face identification model and measure the confusion with the maximum value of the output probability distribution. What's more, to defend our detector under adversary scenarios, an attention-based finetuning scheme is proposed for the face identification models used in IdP_FSD. Extensive experiments show that the proposed IdP_FSD not only achieves high detection performance on different benchmark datasets and image qualities but also raises the bar for manipulators to evade the detection.
近年来,人脸交换技术正在快速发展并实现了令人惊讶的现状,引起了关于虚假内容的忧虑。作为一种应对措施,已经提出了多种检测方法,并取得了令人瞩目的表现。然而,大多数现有检测方法都在未展示的人脸交换方法和低质量图像上 struggle 维持表现。除了泛化问题,当前检测方法还受到了检测意识操纵者精心制作的规避攻击的脆弱性。在对抗性场景下缺乏鲁棒性,使在现实世界应用人脸交换检测的威胁仍然存在。在本文中,我们提出了一种基于人脸身份识别概率分布的新人脸交换检测方法,称为 idP_FSD,以改善泛化和鲁棒性。 idP_FSD 专门设计用于检测身份属于有限集合的swapped faces,这在现实世界应用中有意义。与以前的通用检测方法相比,我们利用已知的真实 faces 和相关的身份信息,不需要假样本进行训练。 idP_FSD 利用人脸交换的普遍特性,即swapped face 的身份与在swapping过程中涉及的两个 Face 的身份相结合。我们通过人脸身份识别模型的混淆来反映这种特性,并使用输出概率分布的最大值来衡量混淆。此外,为了保护我们在对抗性场景下的检测器,我们提出了一种基于注意力的微调方案,用于 idP_FSD 中使用的人脸身份识别模型。广泛的实验结果表明,提出的 idP_FSD 不仅可以在不同类型的基准数据集和图像质量上实现高水平的检测性能,而且还提高了操纵者规避检测的难度。
https://arxiv.org/abs/2303.13131
We address the challenge of training a large supernet for the object detection task, using a relatively small amount of training data. Specifically, we propose an efficient supernet-based neural architecture search (NAS) method that uses transfer learning and search space pruning. First, the supernet is pre-trained on a classification task, for which large datasets are available. Second, the search space defined by the supernet is pruned by removing candidate models that are predicted to perform poorly. To effectively remove the candidates over a wide range of resource constraints, we particularly design a performance predictor, called path filter, which can accurately predict the relative performance of the models that satisfy similar resource constraints. Hence, supernet training is more focused on the best-performing candidates. Our path filter handles prediction for paths with different resource budgets. Compared to once-for-all, our proposed method reduces the computational cost of the optimal network architecture by 30% and 63%, while yielding better accuracy-floating point operations Pareto front (0.85 and 0.45 points of improvement on average precision for Pascal VOC and COCO, respectively).
我们解决了训练大型超网络用于目标检测任务的挑战,使用了相对较小的训练数据。具体而言,我们提出了一种高效的超网络神经网络架构搜索方法(NAS),该方法使用迁移学习和搜索空间剪枝。首先,超网络在分类任务上进行了预训练,有大量数据可用。其次,超网络定义的搜索空间通过删除预测表现较差的候选模型进行修剪。为了有效地去除在各种资源限制下的候选模型,我们特别设计了性能预测器,称为路径滤波,它能够准确地预测满足类似资源限制的模型的性能相对表现。因此,超网络训练更关注表现最好的候选模型。我们的路径滤波处理不同资源预算下的预测路径。与一次性搜索相比,我们提出的方法降低了最优网络架构的计算成本,下降了30%和63%,同时提供了更好的浮点操作精度 Pareto 前端(分别提高0.85点和0.45点)。
https://arxiv.org/abs/2303.13121
Cell detection is a fundamental task in computational pathology that can be used for extracting high-level medical information from whole-slide images. For accurate cell detection, pathologists often zoom out to understand the tissue-level structures and zoom in to classify cells based on their morphology and the surrounding context. However, there is a lack of efforts to reflect such behaviors by pathologists in the cell detection models, mainly due to the lack of datasets containing both cell and tissue annotations with overlapping regions. To overcome this limitation, we propose and publicly release OCELOT, a dataset purposely dedicated to the study of cell-tissue relationships for cell detection in histopathology. OCELOT provides overlapping cell and tissue annotations on images acquired from multiple organs. Within this setting, we also propose multi-task learning approaches that benefit from learning both cell and tissue tasks simultaneously. When compared against a model trained only for the cell detection task, our proposed approaches improve cell detection performance on 3 datasets: proposed OCELOT, public TIGER, and internal CARP datasets. On the OCELOT test set in particular, we show up to 6.79 improvement in F1-score. We believe the contributions of this paper, including the release of the OCELOT dataset at this https URL are a crucial starting point toward the important research direction of incorporating cell-tissue relationships in computation pathology.
细胞检测是计算病理学中的基本概念任务,可以用于从整张切片图像中提取高级别的医疗信息。对于准确的细胞检测,病理学家通常放大以理解组织级结构,并放大以根据细胞的形态和周围环境进行分类。然而,缺乏在细胞检测模型中体现病理学家行为的努力,主要是因为缺乏包含细胞和组织注释有重叠区域的dataset。为了克服这一限制,我们提议并公开发布OCELOT,这是一个专门用于研究细胞-组织关系的研究dataset。OCELOT提供从多个器官获取的重叠细胞和组织注释的图像。在此情况下,我们也提出了多任务学习方法,可以从同时学习细胞和组织任务中受益匪浅。与仅训练用于细胞检测任务模型相比,我们提出的方法在3个dataset上提高了细胞检测性能:提议的OCELOT、公共鲸鱼和内部CARPdataset。在OCELOT测试集上,我们表现出高达6.79的提高F1得分。我们相信本文的贡献,包括在此httpsURL上的发布OCELOTdataset,是计算病理学中融入细胞-组织关系的重要研究方向的关键起点。
https://arxiv.org/abs/2303.13110
To benefit the complementary information between heterogeneous data, we introduce a new Multimodal Transformer (MMFormer) for Remote Sensing (RS) image classification using Hyperspectral Image (HSI) accompanied by another source of data such as Light Detection and Ranging (LiDAR). Compared with traditional Vision Transformer (ViT) lacking inductive biases of convolutions, we first introduce convolutional layers to our MMFormer to tokenize patches from multimodal data of HSI and LiDAR. Then we propose a Multi-scale Multi-head Self-Attention (MSMHSA) module to address the problem of compatibility which often limits to fuse HSI with high spectral resolution and LiDAR with relatively low spatial resolution. The proposed MSMHSA module can incorporate HSI to LiDAR data in a coarse-to-fine manner enabling us to learn a fine-grained representation. Ex- tensive experiments on widely used benchmarks (e.g., Trento and MUUFL) demonstrate the effectiveness and superiority of our proposed MMFormer for RS image classification.
为了 benefit heterogeneous data 之间的互补信息,我们介绍了一种新的多模态Transformer(MM former),用于遥感(RS)图像分类,使用高光谱图像(HSI)并配以光度检测和测距(LiDAR)数据。与传统的视觉Transformer(ViT)相比,我们没有卷积的诱导偏见,因此我们首先介绍了卷积层到我们的MM former,以从HSI和LiDAR的多种模式数据中提取斑点。然后,我们提出了一种多尺度多眼自注意力(MSMHSA)模块,以解决兼容性问题,这通常限制于将高光谱分辨率的HSI和相对较低的Spatial resolution的LiDAR组合起来。 proposed MSMHSA 模块可以以一种粗到精的方式将HSI和LiDAR数据集成,从而使我们能够学习更精细的表示。在广泛应用基准(例如托尔特洛和MUFL)上的对抗实验证明了我们提出的MM former对于RS图像分类的有效性和优越性。
https://arxiv.org/abs/2303.13101
Active learning selects informative samples for annotation within budget, which has proven efficient recently on object detection. However, the widely used active detection benchmarks conduct image-level evaluation, which is unrealistic in human workload estimation and biased towards crowded images. Furthermore, existing methods still perform image-level annotation, but equally scoring all targets within the same image incurs waste of budget and redundant labels. Having revealed above problems and limitations, we introduce a box-level active detection framework that controls a box-based budget per cycle, prioritizes informative targets and avoids redundancy for fair comparison and efficient application. Under the proposed box-level setting, we devise a novel pipeline, namely Complementary Pseudo Active Strategy (ComPAS). It exploits both human annotations and the model intelligence in a complementary fashion: an efficient input-end committee queries labels for informative objects only; meantime well-learned targets are identified by the model and compensated with pseudo-labels. ComPAS consistently outperforms 10 competitors under 4 settings in a unified codebase. With supervision from labeled data only, it achieves 100% supervised performance of VOC0712 with merely 19% box annotations. On the COCO dataset, it yields up to 4.3% mAP improvement over the second-best method. ComPAS also supports training with the unlabeled pool, where it surpasses 90% COCO supervised performance with 85% label reduction. Our source code is publicly available at this https URL.
Active learning选择预算内 informative 样本进行标注,最近在物体检测方面证明了其高效性。然而,广泛使用的 active 检测基准进行图像级评估,这在人类工作量估算和偏向密集图像方面是不现实的。此外,现有方法仍然进行图像级标注,但对所有目标在同一图像中同样评分会导致浪费预算和冗余标签。揭示了上述问题和限制后,我们引入了一个 box-level active 检测框架,该框架控制每个周期基于 box 级别的预算,优先考虑 informative 目标,并避免冗余标签,以公平比较和高效应用为目的。在提出的 box-level 设置下,我们设计了一种新型管道,即互补的伪主动策略(ComPAS)。它利用人类标注和模型智能以互补的方式:高效的输入结束委员会仅查询 informative 对象的标签;同时,通过学习目标,模型将其识别为伪标签补偿。 ComPAS 在统一代码库中 consistently 优于 10 个竞争对手,在 4 个设置下表现优异。仅从标记数据监督的情况下,它实现 VOC0712 的 100% 监督性能,仅需要 19% 的 box 标注。在 COCO 数据集上,它产生 4.3% 的 mAP 改进,超过第二名方法。 ComPAS 也支持无标记训练,在 90% 的 COCO 监督性能超过 85% 的标签减少的情况下支持训练。我们的源代码在这个 https URL 上公开可用。
https://arxiv.org/abs/2303.13089
Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA$^+$ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA$^+$ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark.
Open-vocabulary detection (OVD) 是一种目标检测任务,旨在检测对象来自新分类类别超越了检测器训练的基础类别。最近的 OVD 方法依赖于大型视觉语言预训练模型,如 CLIP,以识别新对象。我们识别了两个核心障碍,当将这些模型融入检测器训练时需要解决:(1) 应用整个图像训练的 VL-模型用于区域识别任务时的分布不匹配;(2) 难以定位未观测过类的对象。为了克服这些障碍,我们提出了 CORA,一种 DETR 风格框架,通过区域提示和Anchor 前匹配来适应 CLIP 以进行 Open-vocabulary 检测。区域提示缓解了整个到区域的分布差距,通过提示 CLIP 基于区域分类器的区域特征。Anchor 前匹配帮助学习基于类元匹配机制的可通用对象定位。我们在 COCO OVD 基准上评估了 CORA,在 novel 类上取得了 41.7 AP50,比先前的 SOTA 提高了 2.4 AP50 甚至不需要额外的训练数据。当有额外的训练数据时,我们训练了 CORA$^+$,在真实的基类注释和由 CORA 计算的额外的伪边界框标签上训练。 CORA$^+$ 在 COCO OVD 基准上取得了 43.1 AP50,在 LVIS OVD 基准上取得了 28.1 框 APr。
https://arxiv.org/abs/2303.13076