Pseudo-labels are widely employed in weakly supervised 3D segmentation tasks where only sparse ground-truth labels are available for learning. Existing methods often rely on empirical label selection strategies, such as confidence thresholding, to generate beneficial pseudo-labels for model training. This approach may, however, hinder the comprehensive exploitation of unlabeled data points. We hypothesize that this selective usage arises from the noise in pseudo-labels generated on unlabeled data. The noise in pseudo-labels may result in significant discrepancies between pseudo-labels and model predictions, thus confusing and affecting the model training greatly. To address this issue, we propose a novel learning strategy to regularize the generated pseudo-labels and effectively narrow the gaps between pseudo-labels and model predictions. More specifically, our method introduces an Entropy Regularization loss and a Distribution Alignment loss for weakly supervised learning in 3D segmentation tasks, resulting in an ERDA learning strategy. Interestingly, by using KL distance to formulate the distribution alignment loss, it reduces to a deceptively simple cross-entropy-based loss which optimizes both the pseudo-label generation network and the 3D segmentation network simultaneously. Despite the simplicity, our method promisingly improves the performance. We validate the effectiveness through extensive experiments on various baselines and large-scale datasets. Results show that ERDA effectively enables the effective usage of all unlabeled data points for learning and achieves state-of-the-art performance under different settings. Remarkably, our method can outperform fully-supervised baselines using only 1% of true annotations. Code and model will be made publicly available.
在弱监督的三维分割任务中,大量的 ground-truth 标签只可用来学习,而训练数据集非常稀疏。现有的方法往往依赖于经验标签选择策略,如置信阈值,来生成有益的伪标签用于模型训练。然而,这种方法可能会妨碍全面利用未标注数据点。我们假设这种选择是源于伪标签在未标注数据上的噪声。伪标签上的噪声可能导致伪标签和模型预测之间的显著差异,因此会混淆和影响模型训练。为了解决这一问题,我们提出了一种新的学习策略, regularize 生成的伪标签,并有效地缩小伪标签和模型预测之间的差距。具体来说,我们的方法引入了熵Regularization Loss和分布对齐 Loss,以弱监督的三维分割任务为例,生成 ERDA 学习策略。有趣的是,通过使用KL距离来制定分布对齐 Loss,它简化成一个看似简单的交叉熵基函数 loss,同时优化伪标签生成网络和三维分割网络。尽管简单,我们的方法却显著提高了性能。我们通过广泛的实验,对各种不同的基准值和大型数据集进行了验证。结果显示,ERDA 有效地使所有未标注数据点的学习有效利用,并在不同设置下实现最先进的性能。值得注意的是,我们的方法只需使用1%的真实标注数据就能超越完全监督基准线。代码和模型将公开可用。
https://arxiv.org/abs/2305.15832
Weakly supervised vision-and-language pre-training (WVLP), which learns cross-modal representations with limited cross-modal supervision, has been shown to effectively reduce the data cost of pre-training while maintaining decent performance on downstream tasks. However, current WVLP methods use only local descriptions of images, i.e., object tags, as cross-modal anchors to construct weakly-aligned image-text pairs for pre-training. This affects the data quality and thus the effectiveness of pre-training. In this paper, we propose to directly take a small number of aligned image-text pairs as anchors, and represent each unaligned image and text by its similarities to these anchors, i.e., relative representations. We build a WVLP framework based on the relative representations, namely RELIT, which collects high-quality weakly-aligned image-text pairs from large-scale image-only and text-only data for pre-training through relative representation-based retrieval and generation. Experiments on four downstream tasks show that RELIT achieves new state-of-the-art results under the weakly supervised setting.
弱监督的视觉和语言前训练(WVLP)通过有限的跨媒体监督学习跨媒体表示,能够 effectively 降低训练数据成本,同时保持后续任务中适度的性能。然而,当前的WVLP方法仅使用图像局部描述,也就是物体标签作为跨媒体锚点,用于构建较弱的对齐图像文本对进行前训练。这会影响数据质量和前训练的有效性。在本文中,我们提议直接使用少量对齐的图像文本对作为锚点,并使用这些锚点之间的相似性代表每个未对齐的图像和文本,即相对表示。我们基于相对表示构建WVLP框架,名为RELIT,通过相对表示方法从大规模图像和文本数据中收集高质量的较弱对齐图像文本对进行前训练。对于四个后续任务的实验结果表明,在弱监督条件下,RELIT取得了新的先进技术结果。
https://arxiv.org/abs/2305.15483
Recent advances in weakly supervised text classification mostly focus on designing sophisticated methods to turn high-level human heuristics into quality pseudo-labels. In this paper, we revisit the seed matching-based method, which is arguably the simplest way to generate pseudo-labels, and show that its power was greatly underestimated. We show that the limited performance of seed matching is largely due to the label bias injected by the simple seed-match rule, which prevents the classifier from learning reliable confidence for selecting high-quality pseudo-labels. Interestingly, simply deleting the seed words present in the matched input texts can mitigate the label bias and help learn better confidence. Subsequently, the performance achieved by seed matching can be improved significantly, making it on par with or even better than the state-of-the-art. Furthermore, to handle the case when the seed words are not made known, we propose to simply delete the word tokens in the input text randomly with a high deletion ratio. Remarkably, seed matching equipped with this random deletion method can often achieve even better performance than that with seed deletion.
弱监督文本分类的最新进展主要关注设计复杂的方法,将高级人类启发式算法转换为高质量的伪标签。在本文中,我们回顾了基于种子匹配的方法,这可能是生成伪标签最简单的方法,并表明其威力严重低估了。我们表明,种子匹配的性能有限,主要是因为简单的种子匹配规则注入了标签偏见,这防止分类器学习可靠的信心,以选择高质量的伪标签。有趣的是,仅仅删除匹配输入文本中的种子词可以缓解标签偏见,并帮助学习更好的信心。随后,种子匹配的性能可以显著提高,使其与或甚至优于最先进的方法。此外,当种子词不被公开时,我们建议 simply delete the word tokens in the input text randomly with a high删除比例。令人惊讶地,配备了这种随机删除方法的种子匹配往往可以实现比种子删除更好的性能。
https://arxiv.org/abs/2305.14794
Large self-supervised pre-trained speech models have achieved remarkable success across various speech-processing tasks. The self-supervised training of these models leads to universal speech representations that can be used for different downstream tasks, ranging from automatic speech recognition (ASR) to speaker identification. Recently, Whisper, a transformer-based model was proposed and trained on large amount of weakly supervised data for ASR; it outperformed several state-of-the-art self-supervised models. Given the superiority of Whisper for ASR, in this paper we explore the transferability of the representation for four other speech tasks in SUPERB benchmark. Moreover, we explore the robustness of Whisper representation for ``in the wild'' tasks where speech is corrupted by environment noise and room reverberation. Experimental results show Whisper achieves promising results across tasks and environmental conditions, thus showing potential for cross-task real-world deployment.
大型自监督预训练语音模型在各种语音处理任务中取得了显著的成功。自监督训练这些模型导致通用语音表示,可用于各种后续任务,包括自动语音识别(ASR)和语音识别(ASR)。最近,Whisper模型被提出并训练了大量弱监督的ASR数据;它 outperform 了几个最先进的自监督模型。鉴于Whisper对ASR的优越性,在本文中,我们探索SuperB基准中其他四个语音任务的表示是否可以转移。此外,我们还探索“在野外”任务中,语音受到环境噪声和房间回音污染时Whisper表示的鲁棒性。实验结果表明,Whisper在不同任务和环境条件下取得了令人期望的结果,因此显示了跨任务现实世界部署的潜力。
https://arxiv.org/abs/2305.14546
Since acquiring perfect supervision is usually difficult, real-world machine learning tasks often confront inaccurate, incomplete, or inexact supervision, collectively referred to as weak supervision. In this work, we present WSAUC, a unified framework for weakly supervised AUC optimization problems, which covers noisy label learning, positive-unlabeled learning, multi-instance learning, and semi-supervised learning scenarios. Within the WSAUC framework, we first frame the AUC optimization problems in various weakly supervised scenarios as a common formulation of minimizing the AUC risk on contaminated sets, and demonstrate that the empirical risk minimization problems are consistent with the true AUC. Then, we introduce a new type of partial AUC, specifically, the reversed partial AUC (rpAUC), which serves as a robust training objective for AUC maximization in the presence of contaminated labels. WSAUC offers a universal solution for AUC optimization in various weakly supervised scenarios by maximizing the empirical rpAUC. Theoretical and experimental results under multiple settings support the effectiveness of WSAUC on a range of weakly supervised AUC optimization tasks.
由于获得完美的监督通常很难,现实世界中的机器学习任务经常面临不准确、不完整或精确监督的情况,统称为较弱监督。在本文中,我们提出了WSAUC,一个统一框架,用于处理较弱监督的AUC优化问题,涵盖了噪声标签学习、阳性无标签学习、多实例学习和半监督学习场景。在WSAUC框架中,我们首先将各种较弱监督场景中的AUC优化问题作为 common formulation,最小化污染 sets 中的AUC风险,并证明实际风险最小化问题与真正的AUC 是一致的。然后,我们引入了一种新的 partial AUC 类型,特别是反转 partialAUC(rpAUC),它在污染标签存在的情况下作为AUC最大化的稳健训练目标。WSAUC 通过最大化实际 rpAUC 在所有较弱监督场景中提供了AUC优化的通用解决方案。在多个设置下的理论和实验结果支持了 WSAUC 对一系列较弱监督的AUC优化任务的有效性。
https://arxiv.org/abs/2305.14258
Etremely Weakly Supervised Text Classification (XWS-TC) refers to text classification based on minimal high-level human guidance, such as a few label-indicative seed words or classification instructions. There are two mainstream approaches for XWS-TC, however, never being rigorously compared: (1) training classifiers based on pseudo-labels generated by (softly) matching seed words (SEED) and (2) prompting (and calibrating) language models using classification instruction (and raw texts) to decode label words (PROMPT). This paper presents the first XWS-TC benchmark to compare the two approaches on fair grounds, where the datasets, supervisions, and hyperparameter choices are standardized across methods. Our benchmarking results suggest that (1) Both SEED and PROMPT approaches are competitive and there is no clear winner; (2) SEED is empirically more tolerant than PROMPT to human guidance (e.g., seed words, classification instructions, and label words) changes; (3) SEED is empirically more selective than PROMPT to the pre-trained language models; (4) Recent SEED and PROMPT methods have close connections and a clustering post-processing step based on raw in-domain texts is a strong performance booster to both. We hope this benchmark serves as a guideline in selecting XWS-TC methods in different scenarios and stimulate interest in developing guidance- and model-robust XWS-TC methods. We release the repo at this https URL.
"Etremely Weakly Supervised Text Classification (XWS-TC)"指的是基于 minimal high-level human guidance,例如一些带有标签指示的种子词或分类指令的文字分类方法。然而,对于 XWS-TC 方法,从未进行严格的比较:(1) 基于(softly)匹配种子词生成伪标签的训练分类器(seed),(2) 使用分类指令(和原始文本)引导语言模型解码标签单词(PROMPT)。本文提出了第一个 XWS-TC 基准,旨在通过公正的理由比较两种方法,在该方法间进行数据集、监督器和超参数选择的统一。我们的基准测试结果显示,(1) 两个种子方法都是竞争力强的,没有明确的胜者;(2) 种子方法 empirical 上比 PROMPT 更加容忍人类指导的变化(例如种子词、分类指令和标签单词);(3) 种子方法 empirical 上比预训练语言模型更加选择性;(4) 最近的两个种子和 PROMPT 方法有密切的联系,基于原始领域文本的聚类后处理步骤是一个增强 both 方法性能的强大性能Booster。我们希望这个基准可以作为在不同情况下选择 XWS-TC 方法的指导方针,并刺激开发 guidance 和模型 robust XWS-TC 方法的兴趣。我们将在这个 https URL 上发布代码仓库。
https://arxiv.org/abs/2305.12749
To mitigate the necessity for large amounts of supervised segmentation annotation sets, multiple Weakly Supervised Semantic Segmentation (WSSS) strategies have been devised. These will often rely on advanced data and model regularization strategies to instigate the development of useful properties (e.g., prediction completeness and fidelity to semantic boundaries) in segmentation priors, notwithstanding the lack of annotated information. In this work, we first create a strong baseline by analyzing complementary WSSS techniques and regularizing strategies, considering their strengths and limitations. We then propose a new Class-specific Adversarial Erasing strategy, comprising two adversarial CAM generating networks being gradually refined to produce robust semantic segmentation proposals. Empirical results suggest that our approach induces substantial improvement in the effectiveness of the baseline, resulting in a noticeable improvement over both Pascal VOC 2012 and MS COCO 2014 datasets.
为了减少需要大量监督分割标注数据的必要性,已经设计了多个弱监督语义分割(WSSS)策略。这些策略通常会依赖高级数据和模型 Regularization 策略,以激发在分割先验中有用的属性(例如,预测完整度和忠实于语义边界)的发展,尽管缺乏标注信息。在本研究中,我们首先通过分析互补的 WSSS 技术和 Regularization 策略,考虑它们的优势和限制,提出了一种新的类特异性对抗 Erasing 策略,由两个对抗CAM生成网络逐步改进,以产生可靠的语义分割建议。实证结果表明,我们的策略导致基线的 effectiveness 大幅度改善,导致在Pascal VOC 2012 和 MS COCO 2014 数据集上明显改进。
https://arxiv.org/abs/2305.12522
State-of-the-art weakly supervised text classification methods, while significantly reduced the required human supervision, still requires the supervision to cover all the classes of interest. This is never easy to meet in practice when human explore new, large corpora without complete pictures. In this paper, we work on a novel yet important problem of weakly supervised open-world text classification, where supervision is only needed for a few examples from a few known classes and the machine should handle both known and unknown classes in test time. General open-world classification has been studied mostly using image classification; however, existing methods typically assume the availability of sufficient known-class supervision and strong unknown-class prior knowledge (e.g., the number and/or data distribution). We propose a novel framework WOT-Class that lifts those strong assumptions. Specifically, it follows an iterative process of (a) clustering text to new classes, (b) mining and ranking indicative words for each class, and (c) merging redundant classes by using the overlapped indicative words as a bridge. Extensive experiments on 7 popular text classification datasets demonstrate that WOT-Class outperforms strong baselines consistently with a large margin, attaining 23.33% greater average absolute macro-F1 over existing approaches across all datasets. Such competent accuracy illuminates the practical potential of further reducing human effort for text classification.
最先进的弱监督文本分类方法,尽管减少了所需人类监督的数量,但仍然需要监督覆盖所有感兴趣的类别。在实践中,人类在没有完整图像的情况下探索新的大型文本集合时,这是非常困难的。在本文中,我们研究了一种新型的、但非常重要的弱监督开放世界文本分类问题,该问题需要仅从几个已知类别中监督几个示例,机器应该在测试时间同时处理已知和未知的类别。开放世界分类通常使用图像分类方法进行研究;然而,现有的方法通常假设有足够的已知类别监督和强大的未知类别先验知识(例如数量或数据分布)。我们提出了一种新的框架—— WOT-Class,该框架克服了这些强有力的假设。具体来说,它采用迭代过程,(a)将文本分类到新类别,(b)挖掘和排名每个类别的指示词,(c)通过使用重叠指示词作为桥梁将重复的类别合并。对7个流行的文本分类数据集进行广泛的实验表明, WOT-Class在显著性上优于强大的基准模型,并在所有数据集上平均绝对 macro-F1值方面表现出更高的平均相对F1值。这种出色的准确性揭示了文本分类中进一步减少人类工作量的实用潜力。
https://arxiv.org/abs/2305.12401
We empirically investigate proper pre-training methods to build good visual tokenizers, making Large Language Models (LLMs) powerful Multimodal Large Language Models (MLLMs). In our benchmark, which is curated to evaluate MLLMs visual semantic understanding and fine-grained perception capabilities, we discussed different visual tokenizers pre-trained with dominant methods (i.e., DeiT, CLIP, MAE, DINO), and observe that: i) Fully/weakly supervised models capture more semantics than self-supervised models, but the gap is narrowed by scaling up the pre-training dataset. ii) Self-supervised models are better at fine-grained perception, where patch-level supervision is particularly effective. iii) Tuning the visual tokenizer leads to the loss of semantics obtained from large-scale pretraining, which is unfavorable with relatively small-scale instruction-tuning dataset. Given the findings, we reviewed methods that attempted to unify semantics and fine-grained visual understanding, e.g., patch-level feature distillation with semantically-rich targets. We obtain an intriguing insight mask-based strategies that were once all the rage may not be applicable for obtaining good visual tokenizers. Based on this critical observation, we obtain a new MLLM equipped with a tailored Good Visual Tokenizer (GVT), which exhibits strong visual comprehension capability at multiple scales. In particular, without introducing extra parameters and task-specific fine-tuning, GVT achieves superior performance on visual question answering, image captioning, and other fine-grained visual understanding tasks such as object counting and multi-class identification.
我们通过经验研究适当的预训练方法来构建好的视觉分词器,使大型语言模型(LLM)成为强大的多模态大型语言模型(MLLM)。在我们的基准中,我们讨论了使用主要方法预训练的不同视觉分词器,例如DeiT、CLIP、MAE和DiNO,并观察到了以下几点:第一,完全/弱监督模型能够捕获更多的语义,而自我监督模型则能够更好地进行精细感知,特别是在 patch-level 监督特别有效的情况下。第三,调整视觉分词器会导致从大规模预训练中获得的语义丢失,这与相对较小的指导微调数据集不太有利。基于这些发现,我们审查了试图统一语义和精细视觉理解的方法,例如 patch-level 特征蒸馏与语义丰富的目标。我们获得了令人感兴趣的洞察力,遮蔽策略,这些策略曾经非常流行,但可能不适用于获得好的视觉分词器。基于这一关键观察,我们获得了一个新的 MLLM 带有定制好的好的视觉分词器(GVT),表现出强大的多尺度视觉理解能力。特别是在没有引入额外的参数和任务特定的 fine-tuning 的情况下,GVT 在视觉问答、图像描述和其他精细的视觉理解任务(如物体计数和多类识别)中表现出卓越的性能。
https://arxiv.org/abs/2305.12223
We introduce a new Information Extraction (IE) task dubbed Instruction-based IE, which aims to ask the system to follow specific instructions or guidelines to extract information. To facilitate research in this area, we construct a dataset called InstructIE, consisting of 270,000 weakly supervised data from Chinese Wikipedia and 1,000 high-quality crowdsourced annotated instances. We further evaluate the performance of various baseline models on the InstructIE dataset. The results reveal that although current models exhibit promising performance, there is still room for improvement. Furthermore, we conduct a comprehensive case study analysis, underlining the challenges inherent in the Instruction-based IE task. Code and dataset are available at this https URL.
我们介绍了一个名为“基于指示的信息提取”的新任务,该任务旨在要求系统遵循特定的指示或指南来提取信息。为了促进该领域的研究,我们建立了一个名为“指示IE”的数据集,该数据集由270,000条弱监督数据从中国维基百科中提取,以及1,000条高质量的 crowdsourced 注释实例组成。我们还对该数据集的各种基线模型的性能进行了评估。结果显示,尽管当前模型表现出良好的性能,但仍有改进的空间。此外,我们进行了全面的案例分析,突出了基于指示的信息提取任务固有的挑战。代码和数据集可在这个httpsURL上获取。
https://arxiv.org/abs/2305.11527
Visual Grounding (VG) refers to locating a region described by expressions in a specific image, which is a critical topic in vision-language fields. To alleviate the dependence on labeled data, existing unsupervised methods try to locate regions using task-unrelated pseudo-labels. However, a large proportion of pseudo-labels are noisy and diversity scarcity in language taxonomy. Inspired by the advances in V-L pretraining, we consider utilizing the VLP models to realize unsupervised transfer learning in downstream grounding task. Thus, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP via exploiting pseudo-language labels to solve VG problem. By elaborating an efficient model structure, we first propose a single-source and multi-source curriculum adapting method for unsupervised VG to progressively sample more reliable cross-modal pseudo-labels to obtain the optimal model, thus achieving implicit knowledge exploiting and denoising. Our method outperforms the existing state-of-the-art unsupervised VG method Pseudo-Q in both single-source and multi-source scenarios with a large margin, i.e., 6.78%~10.67% and 11.39%~24.87% on RefCOCO/+/g datasets, even outperforms existing weakly supervised methods. The code and models will be released at \url{this https URL}.
视觉grounding(VG)是指在特定的图像中利用表达式描述区域的方法,是视觉语言领域的关键问题。为了减轻依赖标记数据的情况,现有未监督学习方法试图使用任务无关的伪标签来确定区域。然而,在语言分类中,大量的伪标签是噪声性的并且缺乏多样性。受到V-L预训练的进展启发,我们考虑使用VLP模型在后续grounding任务中实现未监督 Transfer Learning。因此,我们提出了CLIP-VG,一种新的方法,可以通过利用伪语言标签来解决VG问题。通过优化高效的模型结构,我们首先提出了单源和多源的未监督VG适应方法,逐步样本更可靠的跨模态伪标签,以获得最优模型,从而实现潜在的知识利用和去噪。我们的方法在单源和多源场景中比现有的未监督VG方法 pseudo-Q表现出色,比如在refCOCO/+/g数据集上,比现有的弱监督方法 pseudo-Q表现优异。代码和模型将在不久的将来发布,地址为 \url{this https URL}。
https://arxiv.org/abs/2305.08685
This study introduces an efficacious approach, Masked Collaborative Contrast (MCC), to emphasize semantic regions in weakly supervised semantic segmentation. MCC adroitly incorporates concepts from masked image modeling and contrastive learning to devise Transformer blocks that induce keys to contract towards semantically pertinent regions. Unlike prevalent techniques that directly eradicate patch regions in the input image when generating masks, we scrutinize the neighborhood relations of patch tokens by exploring masks considering keys on the affinity matrix. Moreover, we generate positive and negative samples in contrastive learning by utilizing the masked local output and contrasting it with the global output. Elaborate experiments on commonly employed datasets evidences that the proposed MCC mechanism effectively aligns global and local perspectives within the image, attaining impressive performance.
这项研究介绍了一种有效的方法——遮蔽协同比较(MCC),用于在弱监督下语义分割中强调语义区域。MCC巧妙地融合了遮蔽图像建模和对比学习的概念,设计出了Transformer blocks,使得关键键开始向语义相关区域收缩。与通常在生成 masks 时直接消除输入图像中的局部区域的做法不同,我们仔细审查了 patch token 的邻居关系,通过探索masks,考虑键在匹配矩阵中的位置。此外,我们在对比学习中使用遮蔽的局部输出,与全球输出进行竞争,生成积极和消极样本。常用的数据集实验表明,提出的 MCC 机制有效地将图像中的全局和局部视角对齐,取得了令人印象深刻的性能。
https://arxiv.org/abs/2305.08491
ROI extraction is an active but challenging task in remote sensing because of the complicated landform, the complex boundaries and the requirement of annotations. Weakly supervised learning (WSL) aims at learning a mapping from input image to pixel-wise prediction under image-wise labels, which can dramatically decrease the labor cost. However, due to the imprecision of labels, the accuracy and time consumption of WSL methods are relatively unsatisfactory. In this paper, we propose a two-step ROI extraction based on contractive learning. Firstly, we present to integrate multiscale Grad-CAM to obtain pseudo pixelwise annotations with well boundaries. Then, to reduce the compact of misjudgments in pseudo annotations, we construct a contrastive learning strategy to encourage the features inside ROI as close as possible and separate background features from foreground features. Comprehensive experiments demonstrate the superiority of our proposal. Code is available at this https URL
ROI提取在遥感中是一项具有活力但具有挑战性的任务,因为复杂的地形、复杂的边界和标注要求。弱监督学习(WSL)旨在学习从输入图像到像素级别的预测映射,该映射可以戏剧性地降低劳动力成本。然而,由于标签的不准确性,WSL方法的精度和时间消耗相对不太满意。在本文中,我们提出了基于收缩学习的两步ROI提取方法。首先,我们将融合多尺度grad-CAM,以获得伪像素级别的标注,具有良好的边界。然后,为了减少伪标注中的误判密集,我们建立了对比学习策略,以鼓励 ROI内部特征尽可能接近,并分离背景特征和前景特征。综合实验证明了我们提议的优越性。代码可在这个https URL上获取。
https://arxiv.org/abs/2305.05887
Weakly supervised semantic segmentation (WSSS) based on image-level labels is challenging since it is hard to obtain complete semantic regions. To address this issue, we propose a self-training method that utilizes fused multi-scale class-aware attention maps. Our observation is that attention maps of different scales contain rich complementary information, especially for large and small objects. Therefore, we collect information from attention maps of different scales and obtain multi-scale attention maps. We then apply denoising and reactivation strategies to enhance the potential regions and reduce noisy areas. Finally, we use the refined attention maps to retrain the network. Experiments showthat our method enables the model to extract rich semantic information from multi-scale images and achieves 72.4% mIou scores on both the PASCAL VOC 2012 validation and test sets. The code is available at this https URL.
基于图像级别的标签的弱监督语义分割(WSSS)是挑战性的,因为很难获得完整的语义区域。为了解决这一问题,我们提出了一种自训练方法,利用融合的多尺度类意识到注意力地图。我们的观察是,不同尺度的注意力地图包含丰富的互补信息,特别是对于大小物体。因此,我们从不同尺度的注意力地图收集信息,获得多尺度注意力地图。然后我们应用去噪和激活策略,增强潜在区域,减少噪声区域。最后,我们使用 refined 注意力地图重新训练网络。实验结果表明,我们的方法使模型从多尺度图像中提取丰富的语义信息,并在PASCAL VOC 2012验证和测试集上获得72.4%的mIou评分。代码可在这个https URL上获取。
https://arxiv.org/abs/2305.05841
Weakly Supervised Semantic Segmentation (WSSS) with only image-level supervision has garnered increasing attention due to its low annotation cost compared to pixel-level annotation. Most existing methods rely on Class Activation Maps (CAM) to generate pixel-level pseudo labels for supervised training. However, it is well known that CAM often suffers from partial activation -- activating the most discriminative part instead of the entire object area, and false activation -- unnecessarily activating the background around the object. In this study, we introduce a simple yet effective approach to address these limitations by harnessing the recently released Segment Anything Model (SAM) to generate higher-quality pseudo labels with CAM. SAM is a segmentation foundation model that demonstrates strong zero-shot ability in partitioning images into segments but lacks semantic labels for these regions. To circumvent this, we employ pseudo labels for a specific class as the signal to select the most relevant masks and label them to generate the refined pseudo labels for this class. The segments generated by SAM are highly precise, leading to substantial improvements in partial and false activation. Moreover, existing post-processing modules for producing pseudo labels, such as AffinityNet, are often computationally heavy, with a significantly long training time. Surprisingly, we discovered that using the initial CAM with SAM can achieve on-par performance as the post-processed pseudo label generated from these modules with much less computational cost. Our approach is highly versatile and capable of seamless integration into existing WSSS models without modification to base networks or pipelines. Despite its simplicity, our approach improves the mean Intersection over Union (mIoU) of pseudo labels from five state-of-the-art WSSS methods by 6.2\% on average on the PASCAL VOC 2012 dataset.
仅基于图像级别的监督的语义分割(WSSS)已经吸引了越来越多的关注,因为与像素级别的监督相比,它的标注成本较低。大多数现有方法都依赖于类激活映射(CAM)来生成像素级别的伪标签进行监督训练。然而,众所周知,CAM常常遭受部分激活和错误激活的问题,即只激活最显著的部分而不是整个物体区域,以及错误激活,即不必要的激活周围物体的背景。在本研究中,我们提出了一种简单但有效的方法来解决这些限制,通过利用最近发布的Segment anything Model(SAM)来利用CAM生成更高质量的伪标签。SAM是一个分割基础模型,表现出强大的零样本能力,以分割图像为片段,但缺乏对这些区域的语义标签。为了绕过这个问题,我们使用特定类的伪标签作为信号,选择最相关的掩码,并将它们标签以生成该类的 refined伪标签。由SAM生成的片段非常精确,导致partial和错误激活的重大改善。此外,现有的用于生成伪标签的预处理模块,如AffinityNet,通常计算量很大,训练时间也非常长。令人惊讶地,我们发现,使用最初的CAM和SAM可以与从这些模块生成的预处理伪标签的性能相媲美,而计算成本却更低。我们的方法非常灵活,能够无缝融入现有的WSSS模型中,而无需修改基础网络或管道。尽管它的简单性,我们的方法在PASCAL VOC 2012数据集上平均提高了伪标签的平均Intersection over Union(mIoU)的6.2%。
https://arxiv.org/abs/2305.05803
Representing text into a multidimensional space can be done with sentence embedding models such as Sentence-BERT (SBERT). However, training these models when the data has a complex multilevel structure requires individually trained class-specific models, which increases time and computing costs. We propose a two step approach which enables us to map sentences according to their hierarchical memberships and polarity. At first we teach the upper level sentence space through an AdaCos loss function and then finetune with a novel loss function mainly based on the cosine similarity of intra-level pairs. We apply this method to three different datasets: two weakly supervised Big Five personality dataset obtained from English and Japanese Twitter data and the benchmark MNLI dataset. We show that our single model approach performs better than multiple class-specific classification models.
将文本映射到多维度空间可以使用句子嵌入模型,例如句子BERT(SBERT)。然而,训练这些数据具有复杂多层次结构需要单独训练每个类别的模型,这会增加时间和计算成本。我们提出了一种两步方法,可以帮助我们按照句子的层级成员和极性进行映射。首先,我们通过AdaCos损失函数教授更高级别的句子空间,然后通过一种基于内部级别对对之间的 cosine相似度的新损失函数进行微调。我们应用这种方法到三个不同的数据集:从英语和日语推特数据中提取的两个弱监督的Big Five人格数据集,以及基准MNLI数据集。我们表明,我们的单个模型方法比多个类别特定分类模型表现更好。
https://arxiv.org/abs/2305.05748
In a constant evolving world, change detection is of prime importance to keep updated maps. To better sense areas with complex geometry (urban areas in particular), considering 3D data appears to be an interesting alternative to classical 2D images. In this context, 3D point clouds (PCs) obtained by LiDAR or photogrammetry are very interesting. While recent studies showed the considerable benefit of using deep learning-based methods to detect and characterize changes into raw 3D PCs, these studies rely on large annotated training data to obtain accurate results. The collection of these annotations are tricky and time-consuming. The availability of unsupervised or weakly supervised approaches is then of prime interest. In this paper, we propose an unsupervised method, called DeepCluster 3D Change Detection (DC3DCD), to detect and categorize multiclass changes at point level. We classify our approach in the unsupervised family given the fact that we extract in a completely unsupervised way a number of clusters associated with potential changes. Let us precise that in the end of the process, the user has only to assign a label to each of these clusters to derive the final change map. Our method builds upon the DeepCluster approach, originally designed for image classification, to handle complex raw 3D PCs and perform change segmentation task. An assessment of the method on both simulated and real public dataset is provided. The proposed method allows to outperform fully-supervised traditional machine learning algorithm and to be competitive with fully-supervised deep learning networks applied on rasterization of 3D PCs with a mean of IoU over classes of change of 57.06% and 66.69% for the simulated and the real datasets, respectively.
在不断变化的世界中,变化检测是至关重要的,以保持更新地图。更好地感知具有复杂几何形状的地方(特别是城市地区),考虑3D数据似乎是一种有趣的替代2D图像的古典2D图像的方法。在这个背景下,3D点云(PC)由激光扫描或摄影测量获得是非常有趣的。虽然最近的研究表明使用基于深度学习的方法检测和特征化原始3DPCs有很大的好处,但这些研究依赖于大规模的注释训练数据来获得准确的结果。收集这些注释是非常麻烦和费时的。然后, unsupervised或弱监督的方法的可用性是主要的兴趣。在本文中,我们提出了一种 unsupervised 方法,称为 DeepCluster 3D Change Detection (DC3DCD),以在点级别上检测和分类多级变化。我们将把我们的方法归为 unsupervised 家族,因为我们将以一种完全 unsupervised 的方式从潜在变化中提取一组簇。让我们准确地指出,在处理过程中结束时,用户只需要为每个簇指定标签,以生成最终的变化地图。我们的方法基于最初设计用于图像分类的 DeepCluster 方法,以处理复杂的原始3DPCs并执行变化分割任务。对模拟和真实的公共数据集进行了评估。 proposed 方法可以 outperform 完全监督的传统机器学习算法,并与完全监督的深度学习网络应用于3DPC的像素化上的IoU平均为57.06%和66.69%。
https://arxiv.org/abs/2305.05421
Cancer diagnoses typically involve human pathologists examining whole slide images (WSIs) of tissue section biopsies to identify tumor cells and their subtypes. However, artificial intelligence (AI)-based models, particularly weakly supervised approaches, have recently emerged as viable alternatives. Weakly supervised approaches often use image subsections or tiles as input, with the overall classification of the WSI based on attention scores assigned to each tile. However, this method overlooks the potential for false positives/negatives because tumors can be heterogeneous, with cancer and normal cells growing in patterns larger than a single tile. Such errors at the tile level could lead to misclassification at the tumor level. To address this limitation, we developed a novel deep learning pooling operator called CHARM (Contrastive Histopathology Attention Resolved Models). CHARM leverages the dependencies among single tiles within a WSI and imposes contextual constraints as prior knowledge to multiple instance learning models. We tested CHARM on the subtyping of non-small cell lung cancer (NSLC) and lymph node (LN) metastasis, and the results demonstrated its superiority over other state-of-the-art weakly supervised classification algorithms. Furthermore, CHARM facilitates interpretability by visualizing regions of attention.
癌症诊断通常需要人类病理学家检查组织切片活检整张 slide 图像 (WSIs),以识别肿瘤细胞及其亚型。然而,基于人工智能(AI)的模型,特别是弱监督方法,最近成为了可行的替代方案。弱监督方法通常使用图像 subsection 或tile 作为输入,并将 WSI 的整体分类基于每个tile的注意力得分。然而,这种方法忽略了可能存在的False positive/negative 机会,因为肿瘤可以是异质的,肿瘤细胞和正常细胞的生长模式比一个tile 更大。这些tile 级别的错误可能导致肿瘤级别的分类错误。为了解决这个问题,我们开发了一种新的深度学习合并操作名为 CHARM(Contrastive Histopathology Attention Resolved Models)。 CHARM 利用 WSI 中单个tile之间的依赖关系,并将其作为多个实例学习模型的先前知识施加环境约束。我们测试了 CHARM 对非小细胞肺癌(NSLC)和淋巴结(LN)转移的分类,结果显示其相比其他弱监督分类算法具有优越性。此外, CHARM 通过可视化注意力区域促进了解释性。
https://arxiv.org/abs/2305.05314
Weakly supervised semantic segmentation (WSSS) models relying on class activation maps (CAMs) have achieved desirable performance comparing to the non-CAMs-based counterparts. However, to guarantee WSSS task feasible, we need to generate pseudo labels by expanding the seeds from CAMs which is complex and time-consuming, thus hindering the design of efficient end-to-end (single-stage) WSSS approaches. To tackle the above dilemma, we resort to the off-the-shelf and readily accessible saliency maps for directly obtaining pseudo labels given the image-level class labels. Nevertheless, the salient regions may contain noisy labels and cannot seamlessly fit the target objects, and saliency maps can only be approximated as pseudo labels for simple images containing single-class objects. As such, the achieved segmentation model with these simple images cannot generalize well to the complex images containing multi-class objects. To this end, we propose an end-to-end multi-granularity denoising and bidirectional alignment (MDBA) model, to alleviate the noisy label and multi-class generalization issues. Specifically, we propose the online noise filtering and progressive noise detection modules to tackle image-level and pixel-level noise, respectively. Moreover, a bidirectional alignment mechanism is proposed to reduce the data distribution gap at both input and output space with simple-to-complex image synthesis and complex-to-simple adversarial learning. MDBA can reach the mIoU of 69.5\% and 70.2\% on validation and test sets for the PASCAL VOC 2012 dataset. The source codes and models have been made available at \url{this https URL}.
弱监督语义分割(WSSS)模型依靠类激活图(CAMs)取得了与不使用CAMs-based counterparts的理想表现。然而,为了确保WSSS任务可行,我们需要从CAMs中扩展种子来生成伪标签,这是复杂且费时的,因此阻碍了高效(单阶段)WSSS方法的设计。为了解决上述困境,我们采用现有的、易于获取的可见度图,以直接获得伪标签,根据图像级类标签进行。然而,可见性区域可能包含噪声标签,无法无缝适应目标物体,而可见性图只能近似为伪标签,对于包含单个类物体的简单图像。因此,这些简单图像中的实现模型无法很好地适应包含多个类物体的复杂图像。为了解决这个问题,我们提出了一种端到端多粒度去噪和双向对齐(MDBA)模型,以减轻噪声标签和多类 generalization 问题。具体来说,我们提出了在线噪声过滤和逐步噪声检测模块,分别处理图像级和像素级噪声。此外,我们提出了双向对齐机制,以减少输入和输出空间中的数据分布差距,通过简单的到复杂的图像合成和简单的到复杂的对抗学习实现。MDBA在PASCAL VOC 2012数据集的验证和测试集上可以达到69.5%和70.2%的IoU。源代码和模型已放在\url{this https URL}上。
https://arxiv.org/abs/2305.05154
Explainable artificial intelligence (AI) techniques are increasingly being explored to provide insights into why AI and machine learning (ML) models provide a certain outcome in various applications. However, there has been limited exploration of explainable AI techniques on time-series data, especially in the healthcare context. In this paper, we describe a threshold-based method that utilizes a weakly supervised model and a gradient-based explainable AI technique (i.e. saliency map) and explore its feasibility to identify salient frames of time-series data. Using the dataset from 15 post-stroke survivors performing three upper-limb exercises and labels on whether a compensatory motion is observed or not, we implemented a feed-forward neural network model and utilized gradients of each input on model outcomes to identify salient frames that involve compensatory motions. According to the evaluation using frame-level annotations, our approach achieved a recall of 0.96 and an F2-score of 0.91. Our results demonstrated the potential of a gradient-based explainable AI technique (e.g. saliency map) for time-series data, such as highlighting the frames of a video that therapists should focus on reviewing and reducing the efforts on frame-level labeling for model training.
解释性人工智能(AI)技术正在 increasingly 探索,以提供对 AI 和机器学习(ML)模型在多种应用中提供特定结果的洞察。然而,对时间序列数据的解释性 AI 技术的有限探索,特别是在医疗context中。在本文中,我们描述了一种基于阈值的方法,该方法利用一个弱监督模型和一个基于梯度的解释性 AI 技术(即响应图),并探索了确定时间序列数据中的突出帧的可行性。使用15个中风康复者进行三头肌锻炼的数据集以及是否观察到补偿运动的标签,我们实现了一个前馈神经网络模型,并利用每个输入的梯度来确定涉及补偿运动的突出帧。根据使用帧级别的注释进行评估,我们的方法实现了0.96召回率和0.91F2得分。我们的结果证明了基于梯度的解释性 AI 技术(如响应图)对时间序列数据的的潜力,例如强调视频帧中治疗师应该重点审查的内容,并减少用于模型训练帧级别的注释工作的努力。
https://arxiv.org/abs/2305.05525