Weakly supervised medical image segmentation (MIS) using generative models is crucial for clinical diagnosis. However, the accuracy of the segmentation results is often limited by insufficient supervision and the complex nature of medical imaging. Existing models also only provide a single outcome, which does not allow for the measurement of uncertainty. In this paper, we introduce DiffSeg, a segmentation model for skin lesions based on diffusion difference which exploits diffusion model principles to ex-tract noise-based features from images with diverse semantic information. By discerning difference between these noise features, the model identifies diseased areas. Moreover, its multi-output capability mimics doctors' annotation behavior, facilitating the visualization of segmentation result consistency and ambiguity. Additionally, it quantifies output uncertainty using Generalized Energy Distance (GED), aiding interpretability and decision-making for physicians. Finally, the model integrates outputs through the Dense Conditional Random Field (DenseCRF) algorithm to refine the segmentation boundaries by considering inter-pixel correlations, which improves the accuracy and optimizes the segmentation results. We demonstrate the effectiveness of DiffSeg on the ISIC 2018 Challenge dataset, outperforming state-of-the-art U-Net-based methods.
弱监督下的医学图像分割(MIS)利用生成模型在临床诊断中至关重要。然而,分割结果的准确性常常受到监督不足和医学图像复杂性的限制。现有的模型仅提供单一输出,无法衡量不确定性。在本文中,我们介绍了DiffSeg,一种基于扩散差分的皮肤病变分割模型,它利用扩散模型原理从具有丰富语义信息的图像中提取噪声基于特征。通过鉴别这些噪声特征,模型识别出病变区域。此外,其多输出能力模仿了医生的标注行为,有助于可视化分割结果的一致性和不确定性。此外,通过使用泛化能量距离(GED)量化输出不确定性,有助于医生更好地解释和做出决策。最后,通过Dense Conditional Random Field(DenseCRF)算法将输出集成,通过考虑像素间关联来平滑分割边界,从而提高准确性和优化分割结果。我们在ISIC 2018挑战数据集上证明了DiffSeg的有效性,超越了基于U-Net的最先进方法。
https://arxiv.org/abs/2404.16474
Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable $2.7\times$ acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality. Our source code along with pre-trained model weights and training recipes is available at \url{this https URL}.
对比学习已成为通过图像和文本嵌入之间的对齐来学习有效视觉表示的一种变革性方法。然而,在图像和文本对之间的对比损失计算中,计算对偶相似性提出了计算挑战。本文提出了一种在面向互联网大小的图像-文本数据上的弱监督预训练视觉模型的新方法。将图像-文本数据的预训练重新定义为分类任务。因此,它消除了在对比学习在互联网大小的数据上进行对偶相似性计算的需求,实现了与对比学习在互联网大小的数据上训练的速度相比,训练速度提高了2.7倍。通过广泛的实验,包括检测和分割等不同视觉任务,我们证明了所提出的方法具有高表示质量。我们的源代码以及预训练模型权重和训练 recipe可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.15653
Weakly supervised segmentation methods have gained significant attention due to their ability to reduce the reliance on costly pixel-level annotations during model training. However, the current weakly supervised nuclei segmentation approaches typically follow a two-stage pseudo-label generation and network training process. The performance of the nuclei segmentation heavily relies on the quality of the generated pseudo-labels, thereby limiting its effectiveness. This paper introduces a novel domain-adaptive weakly supervised nuclei segmentation framework using cross-task interaction strategies to overcome the challenge of pseudo-label generation. Specifically, we utilize weakly annotated data to train an auxiliary detection task, which assists the domain adaptation of the segmentation network. To enhance the efficiency of domain adaptation, we design a consistent feature constraint module integrating prior knowledge from the source domain. Furthermore, we develop pseudo-label optimization and interactive training methods to improve the domain transfer capability. To validate the effectiveness of our proposed method, we conduct extensive comparative and ablation experiments on six datasets. The results demonstrate the superiority of our approach over existing weakly supervised approaches. Remarkably, our method achieves comparable or even better performance than fully supervised methods. Our code will be released in this https URL.
弱监督分割方法因其在模型训练过程中减少对昂贵像素级注释的依赖而受到广泛关注。然而,当前的弱监督核分割方法通常遵循两个阶段的伪标签生成和网络训练过程。核分割的表现很大程度上取决于生成的伪标签的质量,从而限制了其有效性的提高。本文提出了一种使用跨任务交互策略的新颖领域自适应弱监督核分割框架,以克服伪标签生成的挑战。具体来说,我们利用弱标注数据来训练辅助检测任务,从而帮助分割网络的领域适应。为了提高领域适应的效率,我们设计了一个一致的特征约束模块,整合了源域的知识。此外,我们还开发了伪标签优化和交互训练方法,以提高领域转移能力。为了验证我们提出的方法的有效性,我们在六个数据集上进行了广泛的比较和消融实验。结果表明,与现有弱监督方法相比,我们的方法具有优越性。值得注意的是,我们的方法甚至可能实现与完全监督方法相媲美的或更好的性能。我们的代码将在此处发布:https://URL。
https://arxiv.org/abs/2404.14956
Current point cloud semantic segmentation has achieved great advances when given sufficient labels. However, the dense annotation of LiDAR point clouds remains prohibitively expensive and time-consuming, unable to keep up with the continuously growing volume of data. In this paper, we propose annotating images with scattered points, followed by utilizing SAM (a Foundation model) to generate semantic segmentation labels for the images. Finally, by mapping the segmentation labels of the images to the LiDAR space using the intrinsic and extrinsic parameters of the camera and LiDAR, we obtain labels for point cloud semantic segmentation, and release Scatter-KITTI and Scatter-nuScenes, which are the first works to utilize image segmentation-based SAM for weakly supervised point cloud semantic segmentation. Furthermore, to mitigate the influence of erroneous pseudo labels obtained from sparse annotations on point cloud features, we propose a multi-modal weakly supervised network for LiDAR semantic segmentation, called MM-ScatterNet. This network combines features from both point cloud and image modalities, enhancing the representation learning of point clouds by introducing consistency constraints between multi-modal features and point cloud features. On the SemanticKITTI dataset, we achieve 66\% of fully supervised performance using only 0.02% of annotated data, and on the NuScenes dataset, we achieve 95% of fully supervised performance using only 0.1% labeled points.
当前的点云语义分割在给出充分标签时取得了很大的进展。然而,对激光雷达点云的密集标注仍然过于昂贵和耗时,无法跟上数据不断增长的数量。在本文中,我们提出使用散射点对图像进行标注,然后利用SAM(一个基础模型)对图像进行语义分割标签生成。最后,通过将图像的语义分割标签映射到激光雷达空间中的内、外参数,我们获得了点云语义分割标签,并释放了Scatter-KITTI和Scatter-nuScenes,这是第一个利用基于图像分割的SAM进行弱监督点云语义分割的工作。此外,为了减轻从稀疏标注中获得的错误伪标签对点云特征的影响,我们提出了一个多模态弱监督网络,称为MM-ScatterNet。该网络结合了点云和图像模态的特征,通过引入多模态特征与点云特征之间的一致性约束,增强了点云的表示学习。在SemanticKITTI数据集上,我们实现了66%的完全监督性能,只需要0.02%的注释数据,而在NuScenes数据集上,我们实现了95%的完全监督性能,只需要0.1%的标注点。
https://arxiv.org/abs/2404.12861
Deep learning is dramatically transforming the field of medical imaging and radiology, enabling the identification of pathologies in medical images, including computed tomography (CT) and X-ray scans. However, the performance of deep learning models, particularly in segmentation tasks, is often limited by the need for extensive annotated datasets. To address this challenge, the capabilities of weakly supervised semantic segmentation are explored through the lens of Explainable AI and the generation of counterfactual explanations. The scope of this research is development of a novel counterfactual inpainting approach (COIN) that flips the predicted classification label from abnormal to normal by using a generative model. For instance, if the classifier deems an input medical image X as abnormal, indicating the presence of a pathology, the generative model aims to inpaint the abnormal region, thus reversing the classifier's original prediction label. The approach enables us to produce precise segmentations for pathologies without depending on pre-existing segmentation masks. Crucially, image-level labels are utilized, which are substantially easier to acquire than creating detailed segmentation masks. The effectiveness of the method is demonstrated by segmenting synthetic targets and actual kidney tumors from CT images acquired from Tartu University Hospital in Estonia. The findings indicate that COIN greatly surpasses established attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an alternative counterfactual explanation method introduced by Singla et al. This evidence suggests that COIN is a promising approach for semantic segmentation of tumors in CT images, and presents a step forward in making deep learning applications more accessible and effective in healthcare, where annotated data is scarce.
深度学习正在深刻地改变医学影像和放射学领域,以前所未有的方式识别医学图像中的疾病,包括计算机断层扫描(CT)和X光扫描。然而,深度学习模型的性能,尤其是在分割任务中,常常受到需要大量注释数据的需求的限制。为解决这个问题,通过 Explainable AI 和生成反事实解释来探索弱监督语义分割模型的能力。这项研究旨在开发一种新颖的逆向修复方法(COIN),通过使用生成模型在预测分类标签异常的情况下,将预测分类标签从异常转为正常。例如,如果分类器认为输入医学图像X异常,表示存在疾病,生成模型旨在修复异常区域,从而反转分类器的原始预测标签。该方法使我们能够在不依赖预先存在的分割掩码的情况下精确地分割出疾病。关键的是,图像级标签被利用,这比创建详细的分割掩码要容易得多。该方法的效果由从爱沙尼亚图尔图大学医院的CT图像中分割出合成目标和实际肾肿瘤来证明。研究结果表明,COIN远远超过了已有的归因方法,如RISE、ScoreCAM和LayerCAM,以及Singla等人提出的另一种反事实解释方法。这一证据表明,COIN是用于CT图像肿瘤语义分割的有前景的方法,并为在医疗保健中使深度学习应用更具有可行性和效果铺平了道路,而注释数据又稀缺。
https://arxiv.org/abs/2404.12832
Annotating lots of 3D medical images for training segmentation models is time-consuming. The goal of weakly supervised semantic segmentation is to train segmentation models without using any ground truth segmentation masks. Our work addresses the case where only image-level categorical labels, indicating the presence or absence of a particular region of interest (such as tumours or lesions), are available. Most existing methods rely on class activation mapping (CAM). We propose a novel approach, ToNNO, which is based on the Tomographic reconstruction of a Neural Network's Output. Our technique extracts stacks of slices with different angles from the input 3D volume, feeds these slices to a 2D encoder, and applies the inverse Radon transform in order to reconstruct a 3D heatmap of the encoder's predictions. This generic method allows to perform dense prediction tasks on 3D volumes using any 2D image encoder. We apply it to weakly supervised medical image segmentation by training the 2D encoder to output high values for slices containing the regions of interest. We test it on four large scale medical image datasets and outperform 2D CAM methods. We then extend ToNNO by combining tomographic reconstruction with CAM methods, proposing Averaged CAM and Tomographic CAM, which obtain even better results.
给大量的3D医疗图像进行注释是一项耗时的工作。弱监督语义分割的目标是训练无需使用任何真实分割掩膜的分割模型。我们的工作解决了一个只有图像级别分类标签(表示兴趣区域的存在或缺失,如肿瘤或病变)的情况。大多数现有方法依赖于类激活映射(CAM)。我们提出了一种新方法ToNNO,它是基于神经网络输出Tomographic重构的。我们的技术从输入3D体积中提取不同角度的切片,将这些切片输入2D编码器,并应用逆Radon变换来重构编码器的预测的3D热图。这种通用方法允许使用任何2D图像编码器对3D体积进行密集预测。我们将它应用于弱监督医疗图像分割,通过训练2D编码器为包含感兴趣区域的切片提供高值。我们在四个大型医疗图像数据集上测试它,并优于2D CAM方法。然后,我们将ToNNO扩展,通过结合断层扫描和CAM方法,提出平均CAM和断层扫描CAM,获得更好的结果。
https://arxiv.org/abs/2404.13103
We introduce Contrastive Gaussian Clustering, a novel approach capable of provide segmentation masks from any viewpoint and of enabling 3D segmentation of the scene. Recent works in novel-view synthesis have shown how to model the appearance of a scene via a cloud of 3D Gaussians, and how to generate accurate images from a given viewpoint by projecting on it the Gaussians before $\alpha$ blending their color. Following this example, we train a model to include also a segmentation feature vector for each Gaussian. These can then be used for 3D scene segmentation, by clustering Gaussians according to their feature vectors; and to generate 2D segmentation masks, by projecting the Gaussians on a plane and $\alpha$ blending over their segmentation features. Using a combination of contrastive learning and spatial regularization, our method can be trained on inconsistent 2D segmentation masks, and still learn to generate segmentation masks consistent across all views. Moreover, the resulting model is extremely accurate, improving the IoU accuracy of the predicted masks by $+8\%$ over the state of the art. Code and trained models will be released soon.
我们介绍了一种名为 Contrastive Gaussian Clustering 的新方法,它可以从任何视角提供分割掩码,并实现场景的 3D 分割。最近的新视图合成工作展示了如何通过 3D 高斯云来建模场景的 appearance,以及如何在给定视角上投影高斯并在 $\alpha$ 融合后生成准确图像的方法。遵循这个例子,我们训练了一个模型,每个高斯还包括一个分割特征向量。这些特征向量可以用于 3D 场景分割,通过根据其特征向量聚类高斯;还可以用于生成 2D 分割掩码,通过在平面上投影高斯并在其分割特征上进行 $\alpha$ 融合。通过对比学习与空间正则化,我们的方法可以在不一致的 2D 分割掩码上进行训练,同时仍然能在所有视角上生成一致的分割掩码。此外,所得模型非常准确,预测掩码的 IoU 准确率提高了 $+8\%$ 以上。代码和训练好的模型不久将发布。
https://arxiv.org/abs/2404.12784
Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.
人类表现出一种固有的能力,即识别支持特定动作的工具。对象部分与它们促进的动作之间的关联通常被称为 affordance。能够根据它们所促进的动作对对象部分进行分割是实现智能机器人使用日常生活中的物体的重要途径。传统的监督学习方法 for affordance segmentation 需要昂贵的像素级注释,而弱监督方法,尽管相对较少要求,但仍依赖于物体交互示例和支持一组动作。这些限制阻碍了可扩展性,可能引入偏差,并且通常将模型限制为有限的一组预定义动作。本文提出 AffordanceCLIP,通过利用预训练的 Vision-Language 模型如 CLIP 中内嵌的隐含 affordance 知识,从而克服这些限制。我们通过实验验证,CLIP 虽然在 affordance 检测方面并未进行专门训练,但保留了许多有价值的信息,对于该任务。我们的 AffordanceCLIP 与具有专门训练的方法相比具有竞争性的 zero-shot 性能,同时提供了几个优势:(i)它适用于任何动作提示,而不仅限于预定义的一组;(ii)与现有解决方案相比,需要训练的额外参数非常少;(iii)它消除了对动作-物体对之间的直接监督,为基于功能进行模型的功能推理打开了新的视角。
https://arxiv.org/abs/2404.12015
Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions, yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics, which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless, we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper, we present an innovative framework, Point PrompTing (PPT), incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically, the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition, we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34%, 14.14%, and 6.97% across RefCOCO, RefCOCO+, and G-Ref, respectively.
参考图像分割(RIS)旨在通过相应的自然语言表达精确分割图像中的指称,然而却依赖于代价高昂的掩膜注释。因此,弱监督的RIS从图像-文本对中学习像素级的语义,这使得对细粒度掩膜进行分割具有挑战性。增强分割精度的自然方法是使用图像分割基础模型SAM来增强弱监督的RIS。然而,我们观察到,仅仅通过集成SAM并不能带来很大的益处,甚至由于过度的关注对象部分而导致性能下降。在本文中,我们提出了一个创新框架,Point Prompting (PPT),结合了所提出的多源课程学习策略来解决这些挑战。具体来说,PPT的核心是一个点生成器,它不仅利用了CLIP的文本-图像对齐能力和SAM的强大掩膜生成能力,还生成负点提示来解决噪音和过度关注对象部分的问题,从而有效地解决其自身的缺陷。此外,我们还引入了一个以物体为中心的 curriculum 学习策略,帮助PPT逐渐从简单的语义对齐学习到更复杂的 RIS。实验证明,我们的PPT在 mIoU 上的性能比之前弱监督技术提高了11.34%、14.14% 和 6.97%,分别应用于 RefCOCO、RefCOCO+ 和 G-Ref。
https://arxiv.org/abs/2404.11998
Weakly Incremental Learning for Semantic Segmentation (WILSS) leverages a pre-trained segmentation model to segment new classes using cost-effective and readily available image-level labels. A prevailing way to solve WILSS is the generation of seed areas for each new class, serving as a form of pixel-level supervision. However, a scenario usually arises where a pixel is concurrently predicted as an old class by the pre-trained segmentation model and a new class by the seed areas. Such a scenario becomes particularly problematic in WILSS, as the lack of pixel-level annotations on new classes makes it intractable to ascertain whether the pixel pertains to the new class or not. To surmount this issue, we propose an innovative, tendency-driven relationship of mutual exclusivity, meticulously tailored to govern the behavior of the seed areas and the predictions generated by the pre-trained segmentation model. This relationship stipulates that predictions for the new and old classes must not conflict whilst prioritizing the preservation of predictions for the old classes, which not only addresses the conflicting prediction issue but also effectively mitigates the inherent challenge of incremental learning - catastrophic forgetting. Furthermore, under the auspices of this tendency-driven mutual exclusivity relationship, we generate pseudo masks for the new classes, allowing for concurrent execution with model parameter updating via the resolution of a bi-level optimization problem. Extensive experiments substantiate the effectiveness of our framework, resulting in the establishment of new benchmarks and paving the way for further research in this field.
我们的研究"Weakly Incremental Learning for Semantic Segmentation (WILSS)"利用预训练的分割模型对新的类别进行分割,使用成本效益高且易得的开源图像级标签进行有效的分割。解决WILSS的一种方法是为新每个类别生成种子区域,作为一种像素级别的监督。然而,在WILSS中,预训练的分割模型预测像素为旧类和新类的情况通常会发生。这种情况在WILSS中变得尤为严重,因为新类缺乏像素级别的注释,因此无法确定像素是否属于新类。为了克服这个问题,我们提出了一个创新的分歧驱动关系,精心设计以管理种子区域和预训练分割模型生成的预测的行为。该关系规定,新旧类的预测不能冲突,同时优先考虑保留旧类的预测,这不仅解决了冲突预测问题,还有效地缓解了逐步学习固有的挑战 - 灾难性遗忘。此外,在分歧驱动 mutual exclusivity关系的帮助下,我们生成新类的伪掩码,使得通过解决双层优化问题对模型参数进行更新时,可以实现同时执行。大量实验证实了我们在该领域的有效性和创新性,从而为该领域建立了新的基准,并为进一步研究铺平道路。
https://arxiv.org/abs/2404.11981
Content moderation faces a challenging task as social media's ability to spread hate speech contrasts with its role in promoting global connectivity. With rapidly evolving slang and hate speech, the adaptability of conventional deep learning to the fluid landscape of online dialogue remains limited. In response, causality inspired disentanglement has shown promise by segregating platform specific peculiarities from universal hate indicators. However, its dependency on available ground truth target labels for discerning these nuances faces practical hurdles with the incessant evolution of platforms and the mutable nature of hate speech. Using confidence based reweighting and contrastive regularization, this study presents HATE WATCH, a novel framework of weakly supervised causal disentanglement that circumvents the need for explicit target labeling and effectively disentangles input features into invariant representations of hate. Empirical validation across platforms two with target labels and two without positions HATE WATCH as a novel method in cross platform hate speech detection with superior performance. HATE WATCH advances scalable content moderation techniques towards developing safer online communities.
内容审查面临着一个具有挑战性的任务,因为社交媒体传播仇恨言论的能力与促进全球连通性的作用相矛盾。随着迅速变化的俚语和仇恨言论,传统深度学习对在线对话灵活领域的适应性仍然有限。为了应对这一挑战,因果性启发下的解耦方法已经表现出通过隔离平台特定奇异特点与通用仇恨指标的 fluid 场景的潜力。然而,其对可用目标标签进行判断的依赖性,在平台不断演进和仇恨言论多变性的情况下,面临着实际障碍。通过基于信心的重新加权和平衡对比 regularization,本研究提出了 HATE WATCH,一种新颖的弱监督因果解码框架,它绕过了明确的目标标签的需要,有效将输入特征解耦为不变的仇恨表示。在两个带有目标标签的平台和两个没有位置的平台进行实证验证后,HATE WATCH 作为跨平台仇恨言论检测的一种新颖方法,具有卓越的性能。HATE WATCH 为实现更安全的在线社区提供了可扩展的审查方法。
https://arxiv.org/abs/2404.11036
Featurizing microscopy images for use in biological research remains a significant challenge, especially for large-scale experiments spanning millions of images. This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs) when training with increasingly larger model backbones and microscopy datasets. Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks, achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases. Additionally, we develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time. We demonstrate that CA-MAEs effectively generalize by inferring and evaluating on a microscopy image dataset (JUMP-CP) generated under different experimental conditions with a different channel structure than our pretraining data (RPI-93M). Our findings motivate continued research into scaling self-supervised learning on microscopy data in order to create powerful foundation models of cellular biology that have the potential to catalyze advancements in drug discovery and beyond.
将显微图像特征化用于生物研究仍然是一个重要的挑战,尤其是在跨越数百万张图像的大型实验中。本文探讨了在训练过程中使用越来越大模型骨干和显微数据集时,弱监督分类器和自监督掩码自动编码器(MAEs)的扩展性质。我们的结果表明,基于ViT的MAEs在各种任务上优于弱监督分类器,在回忆来自公共数据库中预先整理的已知生物学关系时,相对改进多达11.5%。此外,我们开发了一种新的通道无关MAE架构(CA-MAE),允许在推理时输入不同数量和维度的图像。我们证明了CA-MAEs通过推断和评估来有效泛化,与我们的预训练数据(RPI-93M)生成具有不同通道结构的显微图像数据集(JUMP-CP)相比。我们的研究结果激励继续研究在显微数据上进行自监督学习,以创建有潜力的细胞生物学基础模型,该模型可以促进药物发现及其他领域的进步。
https://arxiv.org/abs/2404.10242
Weakly Supervised Object Localization (WSOL) allows for training deep learning models for classification and localization, using only global class-level labels. The lack of bounding box (bbox) supervision during training represents a considerable challenge for hyper-parameter search and model selection. Earlier WSOL works implicitly observed localization performance over a test set which leads to biased performance evaluation. More recently, a better WSOL protocol has been proposed, where a validation set with bbox annotations is held out for model selection. Although it does not rely on the test set, this protocol is unrealistic since bboxes are not available in real-world applications, and when available, it is better to use them directly to fit model weights. Our initial empirical analysis shows that the localization performance of a model declines significantly when using only image-class labels for model selection (compared to using bounding-box annotations). This suggests that adding bounding-box labels is preferable for selecting the best model for localization. In this paper, we introduce a new WSOL validation protocol that provides a localization signal without the need for manual bbox annotations. In particular, we leverage noisy pseudo boxes from an off-the-shelf ROI proposal generator such as Selective-Search, CLIP, and RPN pretrained models for model selection. Our experimental results with several WSOL methods on ILSVRC and CUB-200-2011 datasets show that our noisy boxes allow selecting models with performance close to those selected using ground truth boxes, and better than models selected using only image-class labels.
Weakly Supervised Object Localization (WSOL) 允许使用仅全局类别级别标签来训练深度学习模型进行分类和定位,而无需进行边界框(bbox)监督。在训练过程中缺乏边界框监督代表了超参数搜索和模型选择的相当大挑战。较早的 WSOL 工作在测试集上隐式观察到了局部化性能,从而导致了偏差性能评估。更最近,提出了一个更好的 WSOL 协议,其中为模型选择保留了一个带边界框注释的验证集。尽管这个协议不依赖于测试集,但它是不可信的,因为在现实世界中,边界框是不存在的,而当它们存在时,直接使用它们来调整模型权重会更好。我们的初始实证分析表明,当仅使用图像类标签进行模型选择时,模型的定位性能会显著下降。这表明,为选择最佳的位置模型,应该添加边界框标签。在本文中,我们引入了一个新的 WSOL 验证协议,不需要手动边界框注释来提供定位信号。特别地,我们利用了诸如 Selective-Search、CLIP 和 RPN 预训练模型等噪声伪盒,用于模型选择。我们在 ILSVRC 和 CUB-200-2011 数据集上与几个 WSOL 方法的实验结果表明,我们的噪声盒子能够选择性能接近于通过地面真实框选择的模型,并且比仅使用图像类标签选择的模型更好。
https://arxiv.org/abs/2404.10034
Existing deep trackers are typically trained with largescale video frames with annotated bounding boxes. However, these bounding boxes are expensive and time-consuming to annotate, in particular for large scale datasets. In this paper, we propose to learn tracking representations from single point annotations (i.e., 4.5x faster to annotate than the traditional bounding box) in a weakly supervised manner. Specifically, we propose a soft contrastive learning (SoCL) framework that incorporates target objectness prior into end-to-end contrastive learning. Our SoCL consists of adaptive positive and negative sample generation, which is memory-efficient and effective for learning tracking representations. We apply the learned representation of SoCL to visual tracking and show that our method can 1) achieve better performance than the fully supervised baseline trained with box annotations under the same annotation time cost; 2) achieve comparable performance of the fully supervised baseline by using the same number of training frames and meanwhile reducing annotation time cost by 78% and total fees by 85%; 3) be robust to annotation noise.
现有的深跟踪器通常使用带有注释边界框的大规模视频帧进行训练。然而,这些边界框通常是昂贵且耗时的,尤其是在大规模数据集上。在本文中,我们提出了一种弱监督方式从单点注释中学习跟踪表示。具体来说,我们提出了一种软对比学习(SoCL)框架,该框架将目标对象的 prior 融入了端到端的对比学习。我们的 SoCL 包括自适应的正负样本生成,这种方法具有内存效率和有效的学习跟踪表示。我们将 SoCL 学习到的表示应用于视觉跟踪,并证明了我们的方法可以在与相同注释时间成本下实现比完全监督基准更好的性能,2)通过使用相同的训练帧数量并减少78%的注释时间成本实现与完全监督基准相当的性能,3)对注释噪声具有鲁棒性。
https://arxiv.org/abs/2404.09504
Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole
弱监督视频异常检测(WSVAD)是一个具有挑战性的任务。通过基于弱标签生成细粒度伪标签,然后自训练分类器,这是一种有前景的解决方案。然而,由于现有方法仅使用RGB视觉模态,并且忽略了类文本信息的利用,因此限制了生成更精确伪标签的质量和影响了自训练的性能。受到基于事件描述的手动标注过程的启发,本文我们提出了一个基于Text Prompt with Normality Guidance(TPWNG)的新WSVAD伪标签生成和自训练框架。我们的想法是将预训练的对比语言-图像模型CLIP的丰富语言-视觉知识转移,以将视频事件描述文本和相应视频帧对齐生成伪标签。具体来说,我们首先通过设计两个排名损失和一个分布不协调损失来微调CLIP进行领域适应。然后,我们提出了一种可学习文本提示机制,通过正态性视觉提示进一步改善视频事件描述文本和视频帧的匹配准确性。接着,我们设计了一个基于正态性指导的伪标签生成模块,推断可靠的帧级伪标签。最后,我们引入了一个时间自适应学习模块,以更灵活和准确地学习不同视频事件的时序依赖关系。大量实验证明,我们的方法在两个基准数据集(UCF-Crime和XD-Viole)上的性能达到了最先进的水平。
https://arxiv.org/abs/2404.08531
Weakly supervised semantic segmentation (WSSS) with image-level labels intends to achieve dense tasks without laborious annotations. However, due to the ambiguous contexts and fuzzy regions, the performance of WSSS, especially the stages of generating Class Activation Maps (CAMs) and refining pseudo masks, widely suffers from ambiguity while being barely noticed by previous literature. In this work, we propose UniA, a unified single-staged WSSS framework, to efficiently tackle this issue from the perspective of uncertainty inference and affinity diversification, respectively. When activating class objects, we argue that the false activation stems from the bias to the ambiguous regions during the feature extraction. Therefore, we design a more robust feature representation with a probabilistic Gaussian distribution and introduce the uncertainty estimation to avoid the bias. A distribution loss is particularly proposed to supervise the process, which effectively captures the ambiguity and models the complex dependencies among features. When refining pseudo labels, we observe that the affinity from the prevailing refinement methods intends to be similar among ambiguities. To this end, an affinity diversification module is proposed to promote diversity among semantics. A mutual complementing refinement is proposed to initially rectify the ambiguous affinity with multiple inferred pseudo labels. More importantly, a contrastive affinity loss is further designed to diversify the relations among unrelated semantics, which reliably propagates the diversity into the whole feature representations and helps generate better pseudo masks. Extensive experiments are conducted on PASCAL VOC, MS COCO, and medical ACDC datasets, which validate the efficiency of UniA tackling ambiguity and the superiority over recent single-staged or even most multi-staged competitors.
弱监督语义分割(WSSS)采用图像级标签的目的是实现密集任务,而不需要费力地标注。然而,由于模糊的上下文和模糊区域,WSSS的性能,尤其是生成类激活图(CAM)和改进伪掩码的阶段,在很大程度上受到模糊性的影响,尽管在之前文献中 barely 被注意到。在本文中,我们提出UniA,一种统一的一阶段WSSS框架,从不确定性推理和异质扩展的角度来解决这个问题。当激活类别物体时,我们认为是由于在特征提取过程中对模糊区域的偏见导致的假激活。因此,我们设计了一个具有概率高斯分布的更稳健的特征表示,并引入不确定性估计来避免偏见。特别地,提出了分布损失来指导过程,有效地捕捉了不确定性和建模了特征之间的复杂关系。当优化伪标签时,我们观察到当前优化方法之间的异质性意图相似。为此,我们提出了一个异质化模块来促进语义之间的多样性。互为补充的优化被提出作为首先通过多个推断伪标签初始化模糊异质性的纠正。此外,还进一步设计了一个对比性异质性损失,以进一步分散无关语义之间的关系,可靠地将多样性传播到整个特征表示中,并帮助生成更优秀的伪掩码。在PASCAL VOC、MS COCO和医疗ACDC数据集上进行了广泛的实验,验证了UniA解决不确定性和优越性以及与最近单阶段或甚至是多阶段竞争者相比的效率。
https://arxiv.org/abs/2404.08195
This paper introduces MAD-MIL, a Multi-head Attention-based Deep Multiple Instance Learning model, designed for weakly supervised Whole Slide Images (WSIs) classification in digital pathology. Inspired by the multi-head attention mechanism of the Transformer, MAD-MIL simplifies model complexity while achieving competitive results against advanced models like CLAM and DS-MIL. Evaluated on the MNIST-BAGS and public datasets, including TUPAC16, TCGA BRCA, TCGA LUNG, and TCGA KIDNEY, MAD-MIL consistently outperforms ABMIL. This demonstrates enhanced information diversity, interpretability, and efficiency in slide representation. The model's effectiveness, coupled with fewer trainable parameters and lower computational complexity makes it a promising solution for automated pathology workflows. Our code is available at this https URL.
本文介绍了MAD-MIL,一种基于多头注意力机制的深度多实例学习模型,用于数字病理学中弱监督整张幻灯片(WSIs)分类。受到Transformer中多头注意力的启发,MAD-MIL通过简化模型复杂度同时对抗像CLAM和DS-MIL这样的先进模型。在MNIST-BAGS和公开数据集(包括TUPAC16、TCGA BRCA、TCGA LUNG和TCGA KIDNEY)上进行评估,MAD-MIL始终优于ABMIL。这证明了在幻灯片表示中信息多样性、可解释性和效率的增强。模型的有效性相结合较少的训练参数和较低的计算复杂性使其成为自动病理学工作流程的有前景的解决方案。我们的代码可在此处访问:https://www.xxxxxx。
https://arxiv.org/abs/2404.05362
In hematology, computational models offer significant potential to improve diagnostic accuracy, streamline workflows, and reduce the tedious work of analyzing single cells in peripheral blood or bone marrow smears. However, clinical adoption of computational models has been hampered by the lack of generalization due to large batch effects, small dataset sizes, and poor performance in transfer learning from natural images. To address these challenges, we introduce DinoBloom, the first foundation model for single cell images in hematology, utilizing a tailored DINOv2 pipeline. Our model is built upon an extensive collection of 13 diverse, publicly available datasets of peripheral blood and bone marrow smears, the most substantial open-source cohort in hematology so far, comprising over 380,000 white blood cell images. To assess its generalization capability, we evaluate it on an external dataset with a challenging domain shift. We show that our model outperforms existing medical and non-medical vision models in (i) linear probing and k-nearest neighbor evaluations for cell-type classification on blood and bone marrow smears and (ii) weakly supervised multiple instance learning for acute myeloid leukemia subtyping by a large margin. A family of four DinoBloom models (small, base, large, and giant) can be adapted for a wide range of downstream applications, be a strong baseline for classification problems, and facilitate the assessment of batch effects in new datasets. All models are available at this http URL.
在血液学中,计算模型具有显著的提高诊断准确度、简化工作流程和减轻分析单个细胞在外周血或骨髓涂片中的繁琐工作的潜力。然而,临床采用计算模型受到了由于大规模批效应、数据集较小以及自然图像迁移学习性能差等问题的阻碍。为解决这些问题,我们引入了DinoBloom,第一个用于血液学单个细胞图像的基础模型,利用定制化的DINOv2管道。我们的模型基于一个广泛的 peripheral blood 和 bone marrow smears 的13个不同的公开可用数据集,这是目前血液学开放源代码队列中最大的,包括超过380,000个白细胞图像。为了评估其泛化能力,我们在具有具有挑战性领域转移的外部数据集上对其进行评估。我们发现,我们的模型在(i)血液和骨髓涂片细胞类型分类的线性探测和k-最近邻评估以及(ii)大样本弱监督多实例学习急性髓系白血病亚型分型的性能方面均优于现有医学和非医学视觉模型。四款DinoBloom模型(小、基础、大、巨)可以适应广泛的下游应用,可以作为分类问题的强基线,并有助于在新技术数据集中评估批效应。所有模型都可以在上述http URL找到。
https://arxiv.org/abs/2404.05022
Deep quantization methods have shown high efficiency on large-scale image retrieval. However, current models heavily rely on ground-truth information, hindering the application of quantization in label-hungry scenarios. A more realistic demand is to learn from inexhaustible uploaded images that are associated with informal tags provided by amateur users. Though such sketchy tags do not obviously reveal the labels, they actually contain useful semantic information for supervising deep quantization. To this end, we propose Weakly-Supervised Deep Hyperspherical Quantization (WSDHQ), which is the first work to learn deep quantization from weakly tagged images. Specifically, 1) we use word embeddings to represent the tags and enhance their semantic information based on a tag correlation graph. 2) To better preserve semantic information in quantization codes and reduce quantization error, we jointly learn semantics-preserving embeddings and supervised quantizer on hypersphere by employing a well-designed fusion layer and tailor-made loss functions. Extensive experiments show that WSDHQ can achieve state-of-art performance on weakly-supervised compact coding. Code is available at this https URL.
深度量化方法在大型图像检索任务上表现出了高效性。然而,当前的模型在很大程度上依赖于真实数据,这阻碍了在有标签的场景中应用量化。一个更现实的需求是学习来自非正式标签的不可用上传图像,这些图像与业余用户提供的标签相关。尽管这些标签并不明显地揭示了标签,但它们实际上包含有关深度量化的有用语义信息。为此,我们提出了弱监督深度超球量化(WSDHQ),这是第一个从弱标签图像中学习深度量化的工作。具体来说,1)我们使用词向量来表示标签,并根据标签相关图增强其语义信息。2)为了更好地保留语义信息在量化代码中,并减少量化误差,我们通过采用设计巧妙的融合层和定制损失函数,在超球上共同学习和语义保持嵌入。大量实验证明,WSDHQ可以在弱监督的紧凑编码上实现最先进的性能。代码可在此处下载:https://url.cn/xyz6hU6
https://arxiv.org/abs/2404.04998
The diagnosis of primary liver cancers (PLCs) can be challenging, especially on biopsies and for combined hepatocellular-cholangiocarcinoma (cHCC-CCA). We automatically classified PLCs on routine-stained biopsies using a weakly supervised learning method. Weak tumour/non-tumour annotations served as labels for training a Resnet18 neural network, and the network's last convolutional layer was used to extract new tumour tile features. Without knowledge of the precise labels of the malignancies, we then applied an unsupervised clustering algorithm. Our model identified specific features of hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (iCCA). Despite no specific features of cHCC-CCA being recognized, the identification of HCC and iCCA tiles within a slide could facilitate the diagnosis of primary liver cancers, particularly cHCC-CCA. Method and results: 166 PLC biopsies were divided into training, internal and external validation sets: 90, 29 and 47 samples. Two liver pathologists reviewed each whole-slide hematein eosin saffron (HES)-stained image (WSI). After annotating the tumour/non-tumour areas, 256x256 pixel tiles were extracted from the WSIs and used to train a ResNet18. The network was used to extract new tile features. An unsupervised clustering algorithm was then applied to the new tile features. In a two-cluster model, Clusters 0 and 1 contained mainly HCC and iCCA histological features. The diagnostic agreement between the pathological diagnosis and the model predictions in the internal and external validation sets was 100% (11/11) and 96% (25/26) for HCC and 78% (7/9) and 87% (13/15) for iCCA, respectively. For cHCC-CCA, we observed a highly variable proportion of tiles from each cluster (Cluster 0: 5-97%; Cluster 1: 2-94%).
原发性肝癌(PLCs)的诊断可能具有挑战性,尤其是在活检和联合肝细胞-胆管癌(cHCC-CCA)的情况下。我们使用弱监督学习方法对常规染色活检中的PLC进行自动分类。弱肿瘤/非肿瘤注释充当训练Resnet18神经网络的标签,网络的最后一卷积层用于提取新的肿瘤拓扑特征。在没有肿瘤的准确标签的情况下,我们 then 应用了无监督聚类算法。我们的模型识别出了肝细胞癌(HCC)和肝内胆管癌(iCCA)的特定特征。尽管没有识别到cHCC-CCA的特定特征,但在同一张幻灯片中检测到HCC和iCCA的肿瘤和正常组织片段可以帮助早期诊断原发性肝癌,特别是cHCC-CCA。方法与结果:166个PLC活检样本分为训练、内部和外部验证集:90、29和47个样本。两名肝病学家审查了每个整个幻灯片的苏丹黑(HES)染色图像(WSI)。在对肿瘤/非肿瘤区域进行标注后,从WSIs中提取了256x256像素的瓷砖用于训练Resnet18。网络用于提取新的瓷砖特征。然后应用无监督聚类算法对新的瓷砖特征进行聚类。在双聚类模型中,Cluster 0和1包含主要HCC和iCCA的病理组织学特征。在内部和外部验证集上,病理诊断与模型预测之间的诊断一致性分别为100%(11/11)和96%(25/26),HCC和iCCA分别为78%(7/9)和87%(13/15)。对于cHCC-CCA,我们观察到每个簇中瓷砖的比例高度变异性(Cluster 0:5-97%;Cluster 1:2-94%)。
https://arxiv.org/abs/2404.04983