Accurate classification of fine-grained images remains a challenge in backbones based on convolutional operations or self-attention mechanisms. This study proposes novel dual-current neural networks (DCNN), which combine the advantages of convolutional operations and self-attention mechanisms to improve the accuracy of fine-grained image classification. The main novel design features for constructing a weakly supervised learning backbone model DCNN include (a) extracting heterogeneous data, (b) keeping the feature map resolution unchanged, (c) expanding the receptive field, and (d) fusing global representations and local features. Experimental results demonstrated that using DCNN as the backbone network for classifying certain fine-grained benchmark datasets achieved performance advantage improvements of 13.5--19.5% and 2.2--12.9%, respectively, compared to other advanced convolution or attention-based fine-grained backbones.
准确地对细粒度图像进行分类仍然是一个挑战,特别是在基于卷积操作或自注意力机制的骨干网络中。本研究提出了新颖的双核神经网络(DCNN),结合卷积操作和自注意力机制的优点,以提高细粒度图像分类的准确性。构建弱监督学习骨干模型的DCNN的主要新颖设计特征包括:(a)提取异质数据,(b)保持特征图分辨率不变,(c)扩大感受野,(d)融合全局表示和局部特征。实验结果表明,将DCNN作为分类某些细粒度基准数据集的骨干网络,相比于其他基于卷积或自注意力机制的细粒度骨干网络,性能优势分别达到了13.5--19.5%和2.2--12.9%。
https://arxiv.org/abs/2405.04093
While Large Language Models (LLMs) have demonstrated proficiency in handling complex queries, much of the past work has depended on extensively annotated datasets by human experts. However, this reliance on fully-supervised annotations poses scalability challenges, particularly as models and data requirements grow. To mitigate this, we explore the potential of enhancing LLMs' reasoning abilities with minimal human supervision. In this work, we introduce self-reinforcement, which begins with Supervised Fine-Tuning (SFT) of the model using a small collection of annotated questions. Then it iteratively improves LLMs by learning from the differences in responses from the SFT and unfinetuned models on unlabeled questions. Our approach provides an efficient approach without relying heavily on extensive human-annotated explanations. However, current reasoning benchmarks typically only include golden-reference answers or rationales. Therefore, we present \textsc{PuzzleBen}, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales across various domains, such as brainteasers, puzzles, riddles, parajumbles, and critical reasoning tasks. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities. Our experiments underscore the significance of \textsc{PuzzleBen}, as well as the effectiveness of our methodology as a promising direction in future endeavors. Our dataset and code will be published soon on \texttt{Anonymity Link}.
大规模语言模型(LLMs)在处理复杂查询方面表现出熟练程度,但很大程度上过去的工作都依赖于人类专家充分标注的 datasets。然而,对完全监督标注的依赖在模型和数据要求增长时提出了可扩展性挑战。为了减轻这一依赖,我们探讨了通过最小限度的人类监督增强LLM推理能力的前景。在这项工作中,我们引入了自监督强化,从使用一小部分已标注的问题对模型进行监督微调开始。然后它通过学习来自监督和未微调模型的回答差异来逐步改进LLM。我们的方法提供了一种高效的方法,没有依赖大量的人类标注解释。然而,当前的推理基准通常仅包括黄金参考答案或理由。因此,我们提出了 \textsc{PuzzleBen},一个弱监督基准,它包括了各种领域的25,147个复杂问题、答案和人类生成的推理。我们数据集中的一个独特之处是包括了10,000个未标注的问题,使我们能够探索使用更少的超参数数据来提高LLM的推理能力。我们的实验强调了 \textsc{PuzzleBen} 的意义,以及我们方法作为未来探索的一个有前途的方向的重要性。我们的数据和代码很快将发表在 \texttt{Anonymity Link} 上。
https://arxiv.org/abs/2405.04086
The accuracy and robustness of 3D human pose estimation (HPE) are limited by 2D pose detection errors and 2D to 3D ill-posed challenges, which have drawn great attention to Multi-Hypothesis HPE research. Most existing MH-HPE methods are based on generative models, which are computationally expensive and difficult to train. In this study, we propose a Probabilistic Restoration 3D Human Pose Estimation framework (PRPose) that can be integrated with any lightweight single-hypothesis model. Specifically, PRPose employs a weakly supervised approach to fit the hidden probability distribution of the 2D-to-3D lifting process in the Single-Hypothesis HPE model and then reverse-map the distribution to the 2D pose input through an adaptive noise sampling strategy to generate reasonable multi-hypothesis samples effectively. Extensive experiments on 3D HPE benchmarks (Human3.6M and MPI-INF-3DHP) highlight the effectiveness and efficiency of PRPose. Code is available at: this https URL.
3D人体姿态估计(HPE)的准确性和鲁棒性受到二维姿态检测错误和二维到三维非线性挑战的限制,这些已经引起了多假设性HPE研究的广泛关注。现有的MH-HPE方法都是基于生成模型的,这些模型计算代价高且训练困难。在这项研究中,我们提出了一个概率修复3D人体姿态估计框架(PRPose),可以与任何轻量级的单假设模型集成。具体来说,PRPose采用了一种弱监督方法来适应单假设HPE模型中2D-to-3D提升过程的隐藏概率分布,然后通过自适应噪声采样策略将分布反向映射到2D姿态输入,从而有效地生成合理的多个假设样本。在3D HPE基准(Human3.6M和MPI-INF-3DHP)上的大量实验揭示了PRPose的有效性和效率。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2405.02114
Minimizing the need for pixel-level annotated data for training PET anomaly segmentation networks is crucial, particularly due to time and cost constraints related to expert annotations. Current un-/weakly-supervised anomaly detection methods rely on autoencoder or generative adversarial networks trained only on healthy data, although these are more challenging to train. In this work, we present a weakly supervised and Implicitly guided COuNterfactual diffusion model for Detecting Anomalies in PET images, branded as IgCONDA-PET. The training is conditioned on image class labels (healthy vs. unhealthy) along with implicit guidance to generate counterfactuals for an unhealthy image with anomalies. The counterfactual generation process synthesizes the healthy counterpart for a given unhealthy image, and the difference between the two facilitates the identification of anomaly locations. The code is available at: this https URL
最小化训练PET异常分割网络时需要的高级像素级注释数据至关重要,特别是由于专家注释相关的时间和成本限制。当前的不强监督异常检测方法依赖于仅在健康数据上训练的自编码器或生成对抗网络,尽管这些方法训练起来更具有挑战性。在本文中,我们提出了一个基于弱监督和隐式指导的COuNterfactual扩散模型,用于检测PET图像中的异常,名为IgCONDA-PET。训练取决于图像类标签(健康与不健康)以及针对不健康图像的隐式指导生成反例。反例生成过程生成给定不健康图像的反例,而两者之间的差异有助于异常位置的识别。代码可在此处访问:https://this URL
https://arxiv.org/abs/2405.00239
Given the emergence of deep learning, digital pathology has gained popularity for cancer diagnosis based on histology images. Deep weakly supervised object localization (WSOL) models can be trained to classify histology images according to cancer grade and identify regions of interest (ROIs) for interpretation, using inexpensive global image-class annotations. A WSOL model initially trained on some labeled source image data can be adapted using unlabeled target data in cases of significant domain shifts caused by variations in staining, scanners, and cancer type. In this paper, we focus on source-free (unsupervised) domain adaptation (SFDA), a challenging problem where a pre-trained source model is adapted to a new target domain without using any source domain data for privacy and efficiency reasons. SFDA of WSOL models raises several challenges in histology, most notably because they are not intended to adapt for both classification and localization tasks. In this paper, 4 state-of-the-art SFDA methods, each one representative of a main SFDA family, are compared for WSOL in terms of classification and localization accuracy. They are the SFDA-Distribution Estimation, Source HypOthesis Transfer, Cross-Domain Contrastive Learning, and Adaptively Domain Statistics Alignment. Experimental results on the challenging Glas (smaller, breast cancer) and Camelyon16 (larger, colon cancer) histology datasets indicate that these SFDA methods typically perform poorly for localization after adaptation when optimized for classification.
鉴于深度学习的出现,基于组织图像的癌症诊断在病理学图像中得到了广泛应用。可以训练深度弱监督物体定位(WSOL)模型根据癌症分期对组织图像进行分类,并识别感兴趣区域(ROIs)用于解释。使用廉价的全球图像类注释可以帮助训练WSOL模型。在本文中,我们关注源免费(无监督)领域适应(SFDA)问题,这是一个具有挑战性的问题, 在这种问题中,预训练的源模型被适应到新的目标领域,而不会使用任何源域数据,出于隐私和效率原因。 SFDA的WSOL模型在组织学中引起了几个挑战,尤其是因为他们不是为了适应分类和定位任务而设计的。在本文中,我们比较了四个最先进的SFDA方法,每个都是一些主要SFDA家族的代表,在WSOL方面的分类和定位准确性。它们是SFDA-分布估计、源假设转移、跨领域对比学习以及自适应领域统计对齐。在具有挑战性的Glas(较小,乳腺癌)和Camelyon16(较大,结肠癌)组织数据集的实验结果中,这些SFDA方法在优化分类时通常表现不佳。
https://arxiv.org/abs/2404.19113
The MEDIQA-M3G 2024 challenge necessitates novel solutions for Multilingual & Multimodal Medical Answer Generation in dermatology (wai Yim et al., 2024a). This paper addresses the limitations of traditional methods by proposing a weakly supervised learning approach for open-ended medical question-answering (QA). Our system leverages readily available MEDIQA-M3G images via a VGG16-CNN-SVM model, enabling multilingual (English, Chinese, Spanish) learning of informative skin condition representations. Using pre-trained QA models, we further bridge the gap between visual and textual information through multimodal fusion. This approach tackles complex, open-ended questions even without predefined answer choices. We empower the generation of comprehensive answers by feeding the ViT-CLIP model with multiple responses alongside images. This work advances medical QA research, paving the way for clinical decision support systems and ultimately improving healthcare delivery.
MEDIQA-M3G 2024 挑战需要为皮肤病多语种及多模态医疗答案生成提供新颖解决方案(Yim et al., 2024a)。本文通过提出一个弱监督学习方法来解决传统方法的局限性,为开放性问题医疗答案生成(QA)提供了新思路。我们的系统通过VGG16-CNN-SVM模型利用可用的MEDIQA-M3G图像,实现了多语言(英语,汉语,西班牙语)学习有用的皮肤病表示。通过预训练的 QA 模型,我们通过多模态融合进一步弥合视觉和文本信息之间的差距。这种方法在未定义答案选择的情况下处理复杂、开放性问题。通过在图像旁边提供多个回答,我们通过ViT-CLIP模型生成了全面的答案。这项工作推动了医疗 QA 研究的发展,为临床决策支持系统和最终提高 healthcare delivery 奠定了基础。
https://arxiv.org/abs/2405.01583
Slippery road weather conditions are prevalent in many regions and cause a regular risk for traffic. Still, there has been less research on how autonomous vehicles could detect slippery driving conditions on the road to drive safely. In this work, we propose a method to predict a dense grip map from the area in front of the car, based on postprocessed multimodal sensor data. We trained a convolutional neural network to predict pixelwise grip values from fused RGB camera, thermal camera, and LiDAR reflectance images, based on weakly supervised ground truth from an optical road weather sensor. The experiments show that it is possible to predict dense grip values with good accuracy from the used data modalities as the produced grip map follows both ground truth measurements and local weather conditions, such as snowy areas on the road. The model using only the RGB camera or LiDAR reflectance modality provided good baseline results for grip prediction accuracy while using models fusing the RGB camera, thermal camera, and LiDAR modalities improved the grip predictions significantly.
许多地区都普遍存在滑坡路况,这对交通安全造成了 regular 的风险。然而,关于自动驾驶汽车如何从车辆前方的区域预测滑坡路况以安全驾驶的研究还相对较少。在这项工作中,我们提出了一种基于 postprocessed multimodal sensor data 预测汽车前方区域稀疏抓地图的方法。我们训练了一个卷积神经网络,根据弱监督的地面真实值从融合 RGB 相机、热成像和激光雷达反照像预测像素级别的抓地值。实验结果表明,使用所提供的数据模态,可以预测出良好的抓地值,且生成的抓地图既符合地面真实测量值,又考虑了道路当地的天气状况,例如道路上的积雪区域。使用仅 RGB 相机或 LiDAR 反照像模型的模型,提供了良好的基线抓地预测精度,而将 RGB 相机、热成像和激光雷达模型的融合模型显著提高了抓地预测精度。
https://arxiv.org/abs/2404.17324
We propose a method to remotely verify the authenticity of Optically Variable Devices (OVDs), often referred to as ``holograms'', in identity documents. Our method processes video clips captured with smartphones under common lighting conditions, and is evaluated on two public datasets: MIDV-HOLO and MIDV-2020. Thanks to a weakly-supervised training, we optimize a feature extraction and decision pipeline which achieves a new leading performance on MIDV-HOLO, while maintaining a high recall on documents from MIDV-2020 used as attack samples. It is also the first method, to date, to effectively address the photo replacement attack task, and can be trained on either genuine samples, attack samples, or both for increased performance. By enabling to verify OVD shapes and dynamics with very little supervision, this work opens the way towards the use of massive amounts of unlabeled data to build robust remote identity document verification systems on commodity smartphones. Code is available at this https URL
我们提出了一个方法,用于通过远程验证身份证上的光学可变设备(OVDs),通常被称为“全息照片”的 authenticity。我们的方法处理智能手机上捕获的普通光照条件下的视频片段,并在两个公共数据集上进行评估:MIDV-HOLO 和 MIDV-2020。由于弱监督训练,我们优化了一个特征提取和决策管道,在 MIDV-HOLO 上实现了新的领先性能,同时保持对用作攻击样本的 MIDV-2020 的高度召回。此外,它是迄今为止第一个有效解决照片替换攻击任务的方法,可以用于真实样本、攻击样本或两者来提高性能。通过允许在不带大量监督的情况下验证 OVD 形状和动态,这项工作为使用大规模未标记数据构建 robust 远程身份证明系统提供了途径。代码可在此处访问:https://www.academia.edu/39411041/Transportable_OVD_Authentication
https://arxiv.org/abs/2404.17253
Weakly supervised medical image segmentation (MIS) using generative models is crucial for clinical diagnosis. However, the accuracy of the segmentation results is often limited by insufficient supervision and the complex nature of medical imaging. Existing models also only provide a single outcome, which does not allow for the measurement of uncertainty. In this paper, we introduce DiffSeg, a segmentation model for skin lesions based on diffusion difference which exploits diffusion model principles to ex-tract noise-based features from images with diverse semantic information. By discerning difference between these noise features, the model identifies diseased areas. Moreover, its multi-output capability mimics doctors' annotation behavior, facilitating the visualization of segmentation result consistency and ambiguity. Additionally, it quantifies output uncertainty using Generalized Energy Distance (GED), aiding interpretability and decision-making for physicians. Finally, the model integrates outputs through the Dense Conditional Random Field (DenseCRF) algorithm to refine the segmentation boundaries by considering inter-pixel correlations, which improves the accuracy and optimizes the segmentation results. We demonstrate the effectiveness of DiffSeg on the ISIC 2018 Challenge dataset, outperforming state-of-the-art U-Net-based methods.
弱监督下的医学图像分割(MIS)利用生成模型在临床诊断中至关重要。然而,分割结果的准确性常常受到监督不足和医学图像复杂性的限制。现有的模型仅提供单一输出,无法衡量不确定性。在本文中,我们介绍了DiffSeg,一种基于扩散差分的皮肤病变分割模型,它利用扩散模型原理从具有丰富语义信息的图像中提取噪声基于特征。通过鉴别这些噪声特征,模型识别出病变区域。此外,其多输出能力模仿了医生的标注行为,有助于可视化分割结果的一致性和不确定性。此外,通过使用泛化能量距离(GED)量化输出不确定性,有助于医生更好地解释和做出决策。最后,通过Dense Conditional Random Field(DenseCRF)算法将输出集成,通过考虑像素间关联来平滑分割边界,从而提高准确性和优化分割结果。我们在ISIC 2018挑战数据集上证明了DiffSeg的有效性,超越了基于U-Net的最先进方法。
https://arxiv.org/abs/2404.16474
Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable $2.7\times$ acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality. Our source code along with pre-trained model weights and training recipes is available at \url{this https URL}.
对比学习已成为通过图像和文本嵌入之间的对齐来学习有效视觉表示的一种变革性方法。然而,在图像和文本对之间的对比损失计算中,计算对偶相似性提出了计算挑战。本文提出了一种在面向互联网大小的图像-文本数据上的弱监督预训练视觉模型的新方法。将图像-文本数据的预训练重新定义为分类任务。因此,它消除了在对比学习在互联网大小的数据上进行对偶相似性计算的需求,实现了与对比学习在互联网大小的数据上训练的速度相比,训练速度提高了2.7倍。通过广泛的实验,包括检测和分割等不同视觉任务,我们证明了所提出的方法具有高表示质量。我们的源代码以及预训练模型权重和训练 recipe可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.15653
Weakly supervised segmentation methods have gained significant attention due to their ability to reduce the reliance on costly pixel-level annotations during model training. However, the current weakly supervised nuclei segmentation approaches typically follow a two-stage pseudo-label generation and network training process. The performance of the nuclei segmentation heavily relies on the quality of the generated pseudo-labels, thereby limiting its effectiveness. This paper introduces a novel domain-adaptive weakly supervised nuclei segmentation framework using cross-task interaction strategies to overcome the challenge of pseudo-label generation. Specifically, we utilize weakly annotated data to train an auxiliary detection task, which assists the domain adaptation of the segmentation network. To enhance the efficiency of domain adaptation, we design a consistent feature constraint module integrating prior knowledge from the source domain. Furthermore, we develop pseudo-label optimization and interactive training methods to improve the domain transfer capability. To validate the effectiveness of our proposed method, we conduct extensive comparative and ablation experiments on six datasets. The results demonstrate the superiority of our approach over existing weakly supervised approaches. Remarkably, our method achieves comparable or even better performance than fully supervised methods. Our code will be released in this https URL.
弱监督分割方法因其在模型训练过程中减少对昂贵像素级注释的依赖而受到广泛关注。然而,当前的弱监督核分割方法通常遵循两个阶段的伪标签生成和网络训练过程。核分割的表现很大程度上取决于生成的伪标签的质量,从而限制了其有效性的提高。本文提出了一种使用跨任务交互策略的新颖领域自适应弱监督核分割框架,以克服伪标签生成的挑战。具体来说,我们利用弱标注数据来训练辅助检测任务,从而帮助分割网络的领域适应。为了提高领域适应的效率,我们设计了一个一致的特征约束模块,整合了源域的知识。此外,我们还开发了伪标签优化和交互训练方法,以提高领域转移能力。为了验证我们提出的方法的有效性,我们在六个数据集上进行了广泛的比较和消融实验。结果表明,与现有弱监督方法相比,我们的方法具有优越性。值得注意的是,我们的方法甚至可能实现与完全监督方法相媲美的或更好的性能。我们的代码将在此处发布:https://URL。
https://arxiv.org/abs/2404.14956
Current point cloud semantic segmentation has achieved great advances when given sufficient labels. However, the dense annotation of LiDAR point clouds remains prohibitively expensive and time-consuming, unable to keep up with the continuously growing volume of data. In this paper, we propose annotating images with scattered points, followed by utilizing SAM (a Foundation model) to generate semantic segmentation labels for the images. Finally, by mapping the segmentation labels of the images to the LiDAR space using the intrinsic and extrinsic parameters of the camera and LiDAR, we obtain labels for point cloud semantic segmentation, and release Scatter-KITTI and Scatter-nuScenes, which are the first works to utilize image segmentation-based SAM for weakly supervised point cloud semantic segmentation. Furthermore, to mitigate the influence of erroneous pseudo labels obtained from sparse annotations on point cloud features, we propose a multi-modal weakly supervised network for LiDAR semantic segmentation, called MM-ScatterNet. This network combines features from both point cloud and image modalities, enhancing the representation learning of point clouds by introducing consistency constraints between multi-modal features and point cloud features. On the SemanticKITTI dataset, we achieve 66\% of fully supervised performance using only 0.02% of annotated data, and on the NuScenes dataset, we achieve 95% of fully supervised performance using only 0.1% labeled points.
当前的点云语义分割在给出充分标签时取得了很大的进展。然而,对激光雷达点云的密集标注仍然过于昂贵和耗时,无法跟上数据不断增长的数量。在本文中,我们提出使用散射点对图像进行标注,然后利用SAM(一个基础模型)对图像进行语义分割标签生成。最后,通过将图像的语义分割标签映射到激光雷达空间中的内、外参数,我们获得了点云语义分割标签,并释放了Scatter-KITTI和Scatter-nuScenes,这是第一个利用基于图像分割的SAM进行弱监督点云语义分割的工作。此外,为了减轻从稀疏标注中获得的错误伪标签对点云特征的影响,我们提出了一个多模态弱监督网络,称为MM-ScatterNet。该网络结合了点云和图像模态的特征,通过引入多模态特征与点云特征之间的一致性约束,增强了点云的表示学习。在SemanticKITTI数据集上,我们实现了66%的完全监督性能,只需要0.02%的注释数据,而在NuScenes数据集上,我们实现了95%的完全监督性能,只需要0.1%的标注点。
https://arxiv.org/abs/2404.12861
Deep learning is dramatically transforming the field of medical imaging and radiology, enabling the identification of pathologies in medical images, including computed tomography (CT) and X-ray scans. However, the performance of deep learning models, particularly in segmentation tasks, is often limited by the need for extensive annotated datasets. To address this challenge, the capabilities of weakly supervised semantic segmentation are explored through the lens of Explainable AI and the generation of counterfactual explanations. The scope of this research is development of a novel counterfactual inpainting approach (COIN) that flips the predicted classification label from abnormal to normal by using a generative model. For instance, if the classifier deems an input medical image X as abnormal, indicating the presence of a pathology, the generative model aims to inpaint the abnormal region, thus reversing the classifier's original prediction label. The approach enables us to produce precise segmentations for pathologies without depending on pre-existing segmentation masks. Crucially, image-level labels are utilized, which are substantially easier to acquire than creating detailed segmentation masks. The effectiveness of the method is demonstrated by segmenting synthetic targets and actual kidney tumors from CT images acquired from Tartu University Hospital in Estonia. The findings indicate that COIN greatly surpasses established attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an alternative counterfactual explanation method introduced by Singla et al. This evidence suggests that COIN is a promising approach for semantic segmentation of tumors in CT images, and presents a step forward in making deep learning applications more accessible and effective in healthcare, where annotated data is scarce.
深度学习正在深刻地改变医学影像和放射学领域,以前所未有的方式识别医学图像中的疾病,包括计算机断层扫描(CT)和X光扫描。然而,深度学习模型的性能,尤其是在分割任务中,常常受到需要大量注释数据的需求的限制。为解决这个问题,通过 Explainable AI 和生成反事实解释来探索弱监督语义分割模型的能力。这项研究旨在开发一种新颖的逆向修复方法(COIN),通过使用生成模型在预测分类标签异常的情况下,将预测分类标签从异常转为正常。例如,如果分类器认为输入医学图像X异常,表示存在疾病,生成模型旨在修复异常区域,从而反转分类器的原始预测标签。该方法使我们能够在不依赖预先存在的分割掩码的情况下精确地分割出疾病。关键的是,图像级标签被利用,这比创建详细的分割掩码要容易得多。该方法的效果由从爱沙尼亚图尔图大学医院的CT图像中分割出合成目标和实际肾肿瘤来证明。研究结果表明,COIN远远超过了已有的归因方法,如RISE、ScoreCAM和LayerCAM,以及Singla等人提出的另一种反事实解释方法。这一证据表明,COIN是用于CT图像肿瘤语义分割的有前景的方法,并为在医疗保健中使深度学习应用更具有可行性和效果铺平了道路,而注释数据又稀缺。
https://arxiv.org/abs/2404.12832
Annotating lots of 3D medical images for training segmentation models is time-consuming. The goal of weakly supervised semantic segmentation is to train segmentation models without using any ground truth segmentation masks. Our work addresses the case where only image-level categorical labels, indicating the presence or absence of a particular region of interest (such as tumours or lesions), are available. Most existing methods rely on class activation mapping (CAM). We propose a novel approach, ToNNO, which is based on the Tomographic reconstruction of a Neural Network's Output. Our technique extracts stacks of slices with different angles from the input 3D volume, feeds these slices to a 2D encoder, and applies the inverse Radon transform in order to reconstruct a 3D heatmap of the encoder's predictions. This generic method allows to perform dense prediction tasks on 3D volumes using any 2D image encoder. We apply it to weakly supervised medical image segmentation by training the 2D encoder to output high values for slices containing the regions of interest. We test it on four large scale medical image datasets and outperform 2D CAM methods. We then extend ToNNO by combining tomographic reconstruction with CAM methods, proposing Averaged CAM and Tomographic CAM, which obtain even better results.
给大量的3D医疗图像进行注释是一项耗时的工作。弱监督语义分割的目标是训练无需使用任何真实分割掩膜的分割模型。我们的工作解决了一个只有图像级别分类标签(表示兴趣区域的存在或缺失,如肿瘤或病变)的情况。大多数现有方法依赖于类激活映射(CAM)。我们提出了一种新方法ToNNO,它是基于神经网络输出Tomographic重构的。我们的技术从输入3D体积中提取不同角度的切片,将这些切片输入2D编码器,并应用逆Radon变换来重构编码器的预测的3D热图。这种通用方法允许使用任何2D图像编码器对3D体积进行密集预测。我们将它应用于弱监督医疗图像分割,通过训练2D编码器为包含感兴趣区域的切片提供高值。我们在四个大型医疗图像数据集上测试它,并优于2D CAM方法。然后,我们将ToNNO扩展,通过结合断层扫描和CAM方法,提出平均CAM和断层扫描CAM,获得更好的结果。
https://arxiv.org/abs/2404.13103
We introduce Contrastive Gaussian Clustering, a novel approach capable of provide segmentation masks from any viewpoint and of enabling 3D segmentation of the scene. Recent works in novel-view synthesis have shown how to model the appearance of a scene via a cloud of 3D Gaussians, and how to generate accurate images from a given viewpoint by projecting on it the Gaussians before $\alpha$ blending their color. Following this example, we train a model to include also a segmentation feature vector for each Gaussian. These can then be used for 3D scene segmentation, by clustering Gaussians according to their feature vectors; and to generate 2D segmentation masks, by projecting the Gaussians on a plane and $\alpha$ blending over their segmentation features. Using a combination of contrastive learning and spatial regularization, our method can be trained on inconsistent 2D segmentation masks, and still learn to generate segmentation masks consistent across all views. Moreover, the resulting model is extremely accurate, improving the IoU accuracy of the predicted masks by $+8\%$ over the state of the art. Code and trained models will be released soon.
我们介绍了一种名为 Contrastive Gaussian Clustering 的新方法,它可以从任何视角提供分割掩码,并实现场景的 3D 分割。最近的新视图合成工作展示了如何通过 3D 高斯云来建模场景的 appearance,以及如何在给定视角上投影高斯并在 $\alpha$ 融合后生成准确图像的方法。遵循这个例子,我们训练了一个模型,每个高斯还包括一个分割特征向量。这些特征向量可以用于 3D 场景分割,通过根据其特征向量聚类高斯;还可以用于生成 2D 分割掩码,通过在平面上投影高斯并在其分割特征上进行 $\alpha$ 融合。通过对比学习与空间正则化,我们的方法可以在不一致的 2D 分割掩码上进行训练,同时仍然能在所有视角上生成一致的分割掩码。此外,所得模型非常准确,预测掩码的 IoU 准确率提高了 $+8\%$ 以上。代码和训练好的模型不久将发布。
https://arxiv.org/abs/2404.12784
Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.
人类表现出一种固有的能力,即识别支持特定动作的工具。对象部分与它们促进的动作之间的关联通常被称为 affordance。能够根据它们所促进的动作对对象部分进行分割是实现智能机器人使用日常生活中的物体的重要途径。传统的监督学习方法 for affordance segmentation 需要昂贵的像素级注释,而弱监督方法,尽管相对较少要求,但仍依赖于物体交互示例和支持一组动作。这些限制阻碍了可扩展性,可能引入偏差,并且通常将模型限制为有限的一组预定义动作。本文提出 AffordanceCLIP,通过利用预训练的 Vision-Language 模型如 CLIP 中内嵌的隐含 affordance 知识,从而克服这些限制。我们通过实验验证,CLIP 虽然在 affordance 检测方面并未进行专门训练,但保留了许多有价值的信息,对于该任务。我们的 AffordanceCLIP 与具有专门训练的方法相比具有竞争性的 zero-shot 性能,同时提供了几个优势:(i)它适用于任何动作提示,而不仅限于预定义的一组;(ii)与现有解决方案相比,需要训练的额外参数非常少;(iii)它消除了对动作-物体对之间的直接监督,为基于功能进行模型的功能推理打开了新的视角。
https://arxiv.org/abs/2404.12015
Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions, yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics, which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless, we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper, we present an innovative framework, Point PrompTing (PPT), incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically, the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition, we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34%, 14.14%, and 6.97% across RefCOCO, RefCOCO+, and G-Ref, respectively.
参考图像分割(RIS)旨在通过相应的自然语言表达精确分割图像中的指称,然而却依赖于代价高昂的掩膜注释。因此,弱监督的RIS从图像-文本对中学习像素级的语义,这使得对细粒度掩膜进行分割具有挑战性。增强分割精度的自然方法是使用图像分割基础模型SAM来增强弱监督的RIS。然而,我们观察到,仅仅通过集成SAM并不能带来很大的益处,甚至由于过度的关注对象部分而导致性能下降。在本文中,我们提出了一个创新框架,Point Prompting (PPT),结合了所提出的多源课程学习策略来解决这些挑战。具体来说,PPT的核心是一个点生成器,它不仅利用了CLIP的文本-图像对齐能力和SAM的强大掩膜生成能力,还生成负点提示来解决噪音和过度关注对象部分的问题,从而有效地解决其自身的缺陷。此外,我们还引入了一个以物体为中心的 curriculum 学习策略,帮助PPT逐渐从简单的语义对齐学习到更复杂的 RIS。实验证明,我们的PPT在 mIoU 上的性能比之前弱监督技术提高了11.34%、14.14% 和 6.97%,分别应用于 RefCOCO、RefCOCO+ 和 G-Ref。
https://arxiv.org/abs/2404.11998
Weakly Incremental Learning for Semantic Segmentation (WILSS) leverages a pre-trained segmentation model to segment new classes using cost-effective and readily available image-level labels. A prevailing way to solve WILSS is the generation of seed areas for each new class, serving as a form of pixel-level supervision. However, a scenario usually arises where a pixel is concurrently predicted as an old class by the pre-trained segmentation model and a new class by the seed areas. Such a scenario becomes particularly problematic in WILSS, as the lack of pixel-level annotations on new classes makes it intractable to ascertain whether the pixel pertains to the new class or not. To surmount this issue, we propose an innovative, tendency-driven relationship of mutual exclusivity, meticulously tailored to govern the behavior of the seed areas and the predictions generated by the pre-trained segmentation model. This relationship stipulates that predictions for the new and old classes must not conflict whilst prioritizing the preservation of predictions for the old classes, which not only addresses the conflicting prediction issue but also effectively mitigates the inherent challenge of incremental learning - catastrophic forgetting. Furthermore, under the auspices of this tendency-driven mutual exclusivity relationship, we generate pseudo masks for the new classes, allowing for concurrent execution with model parameter updating via the resolution of a bi-level optimization problem. Extensive experiments substantiate the effectiveness of our framework, resulting in the establishment of new benchmarks and paving the way for further research in this field.
我们的研究"Weakly Incremental Learning for Semantic Segmentation (WILSS)"利用预训练的分割模型对新的类别进行分割,使用成本效益高且易得的开源图像级标签进行有效的分割。解决WILSS的一种方法是为新每个类别生成种子区域,作为一种像素级别的监督。然而,在WILSS中,预训练的分割模型预测像素为旧类和新类的情况通常会发生。这种情况在WILSS中变得尤为严重,因为新类缺乏像素级别的注释,因此无法确定像素是否属于新类。为了克服这个问题,我们提出了一个创新的分歧驱动关系,精心设计以管理种子区域和预训练分割模型生成的预测的行为。该关系规定,新旧类的预测不能冲突,同时优先考虑保留旧类的预测,这不仅解决了冲突预测问题,还有效地缓解了逐步学习固有的挑战 - 灾难性遗忘。此外,在分歧驱动 mutual exclusivity关系的帮助下,我们生成新类的伪掩码,使得通过解决双层优化问题对模型参数进行更新时,可以实现同时执行。大量实验证实了我们在该领域的有效性和创新性,从而为该领域建立了新的基准,并为进一步研究铺平道路。
https://arxiv.org/abs/2404.11981
Content moderation faces a challenging task as social media's ability to spread hate speech contrasts with its role in promoting global connectivity. With rapidly evolving slang and hate speech, the adaptability of conventional deep learning to the fluid landscape of online dialogue remains limited. In response, causality inspired disentanglement has shown promise by segregating platform specific peculiarities from universal hate indicators. However, its dependency on available ground truth target labels for discerning these nuances faces practical hurdles with the incessant evolution of platforms and the mutable nature of hate speech. Using confidence based reweighting and contrastive regularization, this study presents HATE WATCH, a novel framework of weakly supervised causal disentanglement that circumvents the need for explicit target labeling and effectively disentangles input features into invariant representations of hate. Empirical validation across platforms two with target labels and two without positions HATE WATCH as a novel method in cross platform hate speech detection with superior performance. HATE WATCH advances scalable content moderation techniques towards developing safer online communities.
内容审查面临着一个具有挑战性的任务,因为社交媒体传播仇恨言论的能力与促进全球连通性的作用相矛盾。随着迅速变化的俚语和仇恨言论,传统深度学习对在线对话灵活领域的适应性仍然有限。为了应对这一挑战,因果性启发下的解耦方法已经表现出通过隔离平台特定奇异特点与通用仇恨指标的 fluid 场景的潜力。然而,其对可用目标标签进行判断的依赖性,在平台不断演进和仇恨言论多变性的情况下,面临着实际障碍。通过基于信心的重新加权和平衡对比 regularization,本研究提出了 HATE WATCH,一种新颖的弱监督因果解码框架,它绕过了明确的目标标签的需要,有效将输入特征解耦为不变的仇恨表示。在两个带有目标标签的平台和两个没有位置的平台进行实证验证后,HATE WATCH 作为跨平台仇恨言论检测的一种新颖方法,具有卓越的性能。HATE WATCH 为实现更安全的在线社区提供了可扩展的审查方法。
https://arxiv.org/abs/2404.11036
Featurizing microscopy images for use in biological research remains a significant challenge, especially for large-scale experiments spanning millions of images. This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs) when training with increasingly larger model backbones and microscopy datasets. Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks, achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases. Additionally, we develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time. We demonstrate that CA-MAEs effectively generalize by inferring and evaluating on a microscopy image dataset (JUMP-CP) generated under different experimental conditions with a different channel structure than our pretraining data (RPI-93M). Our findings motivate continued research into scaling self-supervised learning on microscopy data in order to create powerful foundation models of cellular biology that have the potential to catalyze advancements in drug discovery and beyond.
将显微图像特征化用于生物研究仍然是一个重要的挑战,尤其是在跨越数百万张图像的大型实验中。本文探讨了在训练过程中使用越来越大模型骨干和显微数据集时,弱监督分类器和自监督掩码自动编码器(MAEs)的扩展性质。我们的结果表明,基于ViT的MAEs在各种任务上优于弱监督分类器,在回忆来自公共数据库中预先整理的已知生物学关系时,相对改进多达11.5%。此外,我们开发了一种新的通道无关MAE架构(CA-MAE),允许在推理时输入不同数量和维度的图像。我们证明了CA-MAEs通过推断和评估来有效泛化,与我们的预训练数据(RPI-93M)生成具有不同通道结构的显微图像数据集(JUMP-CP)相比。我们的研究结果激励继续研究在显微数据上进行自监督学习,以创建有潜力的细胞生物学基础模型,该模型可以促进药物发现及其他领域的进步。
https://arxiv.org/abs/2404.10242