In weakly supervised medical image segmentation, the absence of structural priors and the discreteness of class feature distribution present a challenge, i.e., how to accurately propagate supervision signals from local to global regions without excessively spreading them to other irrelevant regions? To address this, we propose a novel weakly supervised medical image segmentation framework named PCLMix, comprising dynamic mix augmentation, pixel-level contrastive learning, and consistency regularization strategies. Specifically, PCLMix is built upon a heterogeneous dual-decoder backbone, addressing the absence of structural priors through a strategy of dynamic mix augmentation during training. To handle the discrete distribution of class features, PCLMix incorporates pixel-level contrastive learning based on prediction uncertainty, effectively enhancing the model's ability to differentiate inter-class pixel differences and intra-class consistency. Furthermore, to reinforce segmentation consistency and robustness, PCLMix employs an auxiliary decoder for dual consistency regularization. In the inference phase, the auxiliary decoder will be dropped and no computation complexity is increased. Extensive experiments on the ACDC dataset demonstrate that PCLMix appropriately propagates local supervision signals to the global scale, further narrowing the gap between weakly supervised and fully supervised segmentation methods. Our code is available at this https URL.
在弱监督医疗图像分割中,缺乏结构先验和类特征分布的离散性给带来了挑战,即如何在确保不扩散监督信号到其他相关区域的同时,准确地将它们从局部传播到全局?为解决这个问题,我们提出了一个名为PCLMix的新弱监督医疗图像分割框架,包括动态混合增强、像素级别对比学习和一致性正则化策略。具体来说,PCLMix基于异质双解码器架构,通过在训练过程中动态混合增强策略解决结构先验缺失的问题。为了处理类特征的离散分布,PCLMix基于预测不确定性引入了像素级别的对比学习,有效提高了模型区分不同类别的像素差异和类内一致性的能力。此外,为了增强分割的一致性和鲁棒性,PCLMix采用辅助解码器进行双一致性正则化。在推理阶段,辅助解码器将被删除,不会增加计算复杂度。在ACDC数据集上的大量实验证明,PCLMix正确地将局部监督信号传播到全局范围,进一步缩小了弱监督和完全监督分割方法之间的差距。我们的代码可以从该链接获取。
https://arxiv.org/abs/2405.06288
Anatomical shape analysis plays a pivotal role in clinical research and hypothesis testing, where the relationship between form and function is paramount. Correspondence-based statistical shape modeling (SSM) facilitates population-level morphometrics but requires a cumbersome, potentially bias-inducing construction pipeline. Recent advancements in deep learning have streamlined this process in inference by providing SSM prediction directly from unsegmented medical images. However, the proposed approaches are fully supervised and require utilizing a traditional SSM construction pipeline to create training data, thus inheriting the associated burdens and limitations. To address these challenges, we introduce a weakly supervised deep learning approach to predict SSM from images using point cloud supervision. Specifically, we propose reducing the supervision associated with the state-of-the-art fully Bayesian variational information bottleneck DeepSSM (BVIB-DeepSSM) model. BVIB-DeepSSM is an effective, principled framework for predicting probabilistic anatomical shapes from images with quantification of both aleatoric and epistemic uncertainties. Whereas the original BVIB-DeepSSM method requires strong supervision in the form of ground truth correspondence points, the proposed approach utilizes weak supervision via point cloud surface representations, which are more readily obtainable. Furthermore, the proposed approach learns correspondence in a completely data-driven manner without prior assumptions about the expected variability in shape cohort. Our experiments demonstrate that this approach yields similar accuracy and uncertainty estimation to the fully supervised scenario while substantially enhancing the feasibility of model training for SSM construction.
解剖形状分析在临床研究和假设检验中发挥着关键作用,其中形式与功能的关系至关重要。基于配对的统计形状建模(SSM)促进了人口水平形态计量学,但需要一个繁琐、可能存在偏见构建流程。随着深度学习技术的最新进步,在推理过程中直接从无分割医疗图像中提供SSM预测,从而简化了这一过程。然而,所提出的方法是全监督的,需要利用传统的SSM构建流程创建训练数据,从而继承相关的负担和局限性。为了应对这些挑战,我们引入了一种弱监督的深度学习方法,通过点云监督预测SSM。具体来说,我们提出了一个减少与最先进的完全贝叶斯变分信息瓶颈DeepSSM(BVIB-DeepSSM)模型相关的监督的方法。BVIB-DeepSSM是一种有效的、有理的框架,可以从图像中预测概率解剖形状,同时对概率和实证不确定性进行量化。尽管原始的BVIB-DeepSSM方法需要强监督的地面真值配准点,但所提出的方法通过点云表面表示利用弱监督。此外,与传统方法不同,该方法完全基于数据驱动学习,没有关于预计形状随访的方差的可行假设。我们的实验结果表明,与完全监督情况相比,这种方法具有类似的准确性和不确定性估计,同时大大提高了模型训练为SSM构建的可行性。
https://arxiv.org/abs/2405.09697
Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI, VoxConverse and DIHARD III diarization benchmarks.
当前的讲话者语音识别系统在提取讲话者嵌入之前依赖于外部语音活动检测模型。在本文中,我们证明了发言者嵌入提取器的注意系统充当一个弱监督的内部VAD模型,并且其表现与相应的监督VAD系统相当或者更好。随后,通过同时提取VAD日志和相应的讲话者嵌入,可以高效地实现发言者识别。我们详细分析了当前讲话者验证模型中帧级注意系统的行为,并使用ECAPA2讲话者嵌入提出了用于VAD和嵌入提取的新讲话者识别流程。所提出的策略在AMI、VoxConverse和DIHARD III语调基准上获得了最先进的性能。
https://arxiv.org/abs/2405.09142
In this paper, we address the segmentation of tumor subtypes in whole slide images (WSI) by utilizing incomplete label proportions. Specifically, we utilize `partial' label proportions, which give the proportions among tumor subtypes but do not give the proportion between tumor and non-tumor. Partial label proportions are recorded as the standard diagnostic information by pathologists, and we, therefore, want to use them for realizing the segmentation model that can classify each WSI patch into one of the tumor subtypes or non-tumor. We call this problem ``learning from partial label proportions (LPLP)'' and formulate the problem as a weakly supervised learning problem. Then, we propose an efficient algorithm for this challenging problem by decomposing it into two weakly supervised learning subproblems: multiple instance learning (MIL) and learning from label proportions (LLP). These subproblems are optimized efficiently in the end-to-end manner. The effectiveness of our algorithm is demonstrated through experiments conducted on two WSI datasets.
在本文中,我们通过利用不完整的标签比例对全切片图像(WSI)的肿瘤亚型进行分割。具体来说,我们利用“部分”标签比例,它给出了肿瘤亚型之间的比例,但没有给出肿瘤和非肿瘤之间的比例。部分标签比例是病理学家记录的标准化诊断信息,因此我们希望将其用于实现能够将每个WSI补丁归类为肿瘤亚型或非肿瘤的分割模型。我们将这个问题称为“从部分标签比例(LPLP)中学习”,并将其转化为一个弱监督学习问题。然后,我们通过将其分解为两个弱监督学习子问题:多实例学习(MIL)和学习标签比例(LLP)来提出这个问题。这些子问题以端到端的方式优化效率。通过在两个WSI数据集上进行的实验,我们证明了我们算法的有效性。
https://arxiv.org/abs/2405.09041
CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
CLIP模型在零散分类和检索任务上表现出色。但最近的研究表明,CLIP学习到的表示并不适合用于密集预测任务,如目标检测、语义分割或深度估计。更最近地,多阶段训练方法被引入到CLIP模型的研究中,以改善CLIP在下游任务上的表现。在这项工作中,我们发现,仅仅通过提高图像文本数据集中捕获到的描述的质量来改善CLIP的视觉表示质量,从而在下游密集预测视觉任务上取得显著的改进。事实上,我们发现,使用质量好的摘要进行CLIP预训练可以超过最近的有监督、自监督和弱监督预训练方法。我们证明了,当CLIP模型使用ViT-B/16作为图像编码器进行预训练时,在语义分割和深度估计任务上可以获得比最近的先进masked图像建模(MIM)预训练方法更高的mIoU和更低的RMSE。我们发现,移动架构也显著从CLIP预训练中受益。最近的一个移动视觉架构,MCi2,通过CLIP预训练在语义分割任务上的性能与在ImageNet-22k上预训练的Swin-L类似,而其大小缩小了6.1倍。此外,我们还证明了,提高描述质量可以在对密集预测任务进行微调时实现10倍的数据效率。
https://arxiv.org/abs/2405.08911
Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities such as RGB, optical flow, and audio, while only video-level annotations are available. In the pursuit of effective multimodal violence detection (MVD), information redundancy, modality imbalance, and modality asynchrony are identified as three key challenges. In this work, we propose a new weakly supervised MVD method that explicitly addresses these challenges. Specifically, we introduce a multi-scale bottleneck transformer (MSBT) based fusion module that employs a reduced number of bottleneck tokens to gradually condense information and fuse each pair of modalities and utilizes a bottleneck token-based weighting scheme to highlight more important fused features. Furthermore, we propose a temporal consistency contrast loss to semantically align pairwise fused features. Experiments on the largest-scale XD-Violence dataset demonstrate that the proposed method achieves state-of-the-art performance. Code is available at this https URL.
弱监督多模态暴力检测的目标是通过利用多种模态(如RGB、光流和音频)来学习一个暴力检测模型,而只有视频级别的注释可用。在追求有效的多模态暴力检测(MVD)时,我们确定的三个关键挑战是信息冗余、模态不平衡和模态异步。在这篇工作中,我们提出了一种新的弱监督多模态暴力检测方法,明确解决了这些挑战。具体来说,我们引入了一种多尺度瓶颈变换器(MSBT)基于融合模块,采用较少的瓶颈标记来逐渐压缩信息并融合每对模态,并使用瓶颈标记基于的权重方案来突出更重要的融合特征。此外,我们还提出了一种时间一致性对比损失,以在语义上对对齐的融合特征进行平滑。在最大的XD-Violence数据集上进行的实验证明,与最先进的性能相当。代码可在此处访问:https://www. this URL.
https://arxiv.org/abs/2405.05130
Weakly supervised semantic segmentation (WSSS) aims at learning a semantic segmentation model with only image-level tags. Despite intensive research on deep learning approaches over a decade, there is still a significant performance gap between WSSS and full semantic segmentation. Most current WSSS methods always focus on a limited single image (pixel-wise) information while ignoring the valuable inter-image (semantic-wise) information. From this perspective, a novel end-to-end WSSS framework called DSCNet is developed along with two innovations: i) pixel-wise group contrast and semantic-wise graph contrast are proposed and introduced into the WSSS framework; ii) a novel dual-stream contrastive learning (DSCL) mechanism is designed to jointly handle pixel-wise and semantic-wise context information for better WSSS performance. Specifically, the pixel-wise group contrast learning (PGCL) and semantic-wise graph contrast learning (SGCL) tasks form a more comprehensive solution. Extensive experiments on PASCAL VOC and MS COCO benchmarks verify the superiority of DSCNet over SOTA approaches and baseline models.
弱监督语义分割(WSSS)旨在学习仅基于图像级别的标签的语义分割模型。尽管在过去的十年里对深度学习方法进行了广泛研究,但WSSS和完整语义分割之间的性能差距仍然很大。大多数当前的WSSS方法始终关注有限单个图像(像素级)信息,而忽略了宝贵的跨图像(语义级)信息。从这方面来看,与两个创新相结合,我们提出了一个名为DSCNet的新端到端WSSS框架:i)提出了像素级组内对比和语义级图内对比;ii)设计了一种新颖的双流对比学习(DSCL)机制,以更好地处理像素级和语义级上下文信息,从而提高WSSS性能。具体来说,像素级组内对比学习(PGCL)和语义级图内对比学习(SGCL)任务组成更全面的解决方案。在PASCAL VOC和MS COCO基准上进行的实验证实了DSCNet相对于当前最先进的方法的优越性。
https://arxiv.org/abs/2405.04913
Accurate classification of fine-grained images remains a challenge in backbones based on convolutional operations or self-attention mechanisms. This study proposes novel dual-current neural networks (DCNN), which combine the advantages of convolutional operations and self-attention mechanisms to improve the accuracy of fine-grained image classification. The main novel design features for constructing a weakly supervised learning backbone model DCNN include (a) extracting heterogeneous data, (b) keeping the feature map resolution unchanged, (c) expanding the receptive field, and (d) fusing global representations and local features. Experimental results demonstrated that using DCNN as the backbone network for classifying certain fine-grained benchmark datasets achieved performance advantage improvements of 13.5--19.5% and 2.2--12.9%, respectively, compared to other advanced convolution or attention-based fine-grained backbones.
准确地对细粒度图像进行分类仍然是一个挑战,特别是在基于卷积操作或自注意力机制的骨干网络中。本研究提出了新颖的双核神经网络(DCNN),结合卷积操作和自注意力机制的优点,以提高细粒度图像分类的准确性。构建弱监督学习骨干模型的DCNN的主要新颖设计特征包括:(a)提取异质数据,(b)保持特征图分辨率不变,(c)扩大感受野,(d)融合全局表示和局部特征。实验结果表明,将DCNN作为分类某些细粒度基准数据集的骨干网络,相比于其他基于卷积或自注意力机制的细粒度骨干网络,性能优势分别达到了13.5--19.5%和2.2--12.9%。
https://arxiv.org/abs/2405.04093
While Large Language Models (LLMs) have demonstrated proficiency in handling complex queries, much of the past work has depended on extensively annotated datasets by human experts. However, this reliance on fully-supervised annotations poses scalability challenges, particularly as models and data requirements grow. To mitigate this, we explore the potential of enhancing LLMs' reasoning abilities with minimal human supervision. In this work, we introduce self-reinforcement, which begins with Supervised Fine-Tuning (SFT) of the model using a small collection of annotated questions. Then it iteratively improves LLMs by learning from the differences in responses from the SFT and unfinetuned models on unlabeled questions. Our approach provides an efficient approach without relying heavily on extensive human-annotated explanations. However, current reasoning benchmarks typically only include golden-reference answers or rationales. Therefore, we present \textsc{PuzzleBen}, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales across various domains, such as brainteasers, puzzles, riddles, parajumbles, and critical reasoning tasks. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities. Our experiments underscore the significance of \textsc{PuzzleBen}, as well as the effectiveness of our methodology as a promising direction in future endeavors. Our dataset and code will be published soon on \texttt{Anonymity Link}.
大规模语言模型(LLMs)在处理复杂查询方面表现出熟练程度,但很大程度上过去的工作都依赖于人类专家充分标注的 datasets。然而,对完全监督标注的依赖在模型和数据要求增长时提出了可扩展性挑战。为了减轻这一依赖,我们探讨了通过最小限度的人类监督增强LLM推理能力的前景。在这项工作中,我们引入了自监督强化,从使用一小部分已标注的问题对模型进行监督微调开始。然后它通过学习来自监督和未微调模型的回答差异来逐步改进LLM。我们的方法提供了一种高效的方法,没有依赖大量的人类标注解释。然而,当前的推理基准通常仅包括黄金参考答案或理由。因此,我们提出了 \textsc{PuzzleBen},一个弱监督基准,它包括了各种领域的25,147个复杂问题、答案和人类生成的推理。我们数据集中的一个独特之处是包括了10,000个未标注的问题,使我们能够探索使用更少的超参数数据来提高LLM的推理能力。我们的实验强调了 \textsc{PuzzleBen} 的意义,以及我们方法作为未来探索的一个有前途的方向的重要性。我们的数据和代码很快将发表在 \texttt{Anonymity Link} 上。
https://arxiv.org/abs/2405.04086
The accuracy and robustness of 3D human pose estimation (HPE) are limited by 2D pose detection errors and 2D to 3D ill-posed challenges, which have drawn great attention to Multi-Hypothesis HPE research. Most existing MH-HPE methods are based on generative models, which are computationally expensive and difficult to train. In this study, we propose a Probabilistic Restoration 3D Human Pose Estimation framework (PRPose) that can be integrated with any lightweight single-hypothesis model. Specifically, PRPose employs a weakly supervised approach to fit the hidden probability distribution of the 2D-to-3D lifting process in the Single-Hypothesis HPE model and then reverse-map the distribution to the 2D pose input through an adaptive noise sampling strategy to generate reasonable multi-hypothesis samples effectively. Extensive experiments on 3D HPE benchmarks (Human3.6M and MPI-INF-3DHP) highlight the effectiveness and efficiency of PRPose. Code is available at: this https URL.
3D人体姿态估计(HPE)的准确性和鲁棒性受到二维姿态检测错误和二维到三维非线性挑战的限制,这些已经引起了多假设性HPE研究的广泛关注。现有的MH-HPE方法都是基于生成模型的,这些模型计算代价高且训练困难。在这项研究中,我们提出了一个概率修复3D人体姿态估计框架(PRPose),可以与任何轻量级的单假设模型集成。具体来说,PRPose采用了一种弱监督方法来适应单假设HPE模型中2D-to-3D提升过程的隐藏概率分布,然后通过自适应噪声采样策略将分布反向映射到2D姿态输入,从而有效地生成合理的多个假设样本。在3D HPE基准(Human3.6M和MPI-INF-3DHP)上的大量实验揭示了PRPose的有效性和效率。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2405.02114
Minimizing the need for pixel-level annotated data for training PET anomaly segmentation networks is crucial, particularly due to time and cost constraints related to expert annotations. Current un-/weakly-supervised anomaly detection methods rely on autoencoder or generative adversarial networks trained only on healthy data, although these are more challenging to train. In this work, we present a weakly supervised and Implicitly guided COuNterfactual diffusion model for Detecting Anomalies in PET images, branded as IgCONDA-PET. The training is conditioned on image class labels (healthy vs. unhealthy) along with implicit guidance to generate counterfactuals for an unhealthy image with anomalies. The counterfactual generation process synthesizes the healthy counterpart for a given unhealthy image, and the difference between the two facilitates the identification of anomaly locations. The code is available at: this https URL
最小化训练PET异常分割网络时需要的高级像素级注释数据至关重要,特别是由于专家注释相关的时间和成本限制。当前的不强监督异常检测方法依赖于仅在健康数据上训练的自编码器或生成对抗网络,尽管这些方法训练起来更具有挑战性。在本文中,我们提出了一个基于弱监督和隐式指导的COuNterfactual扩散模型,用于检测PET图像中的异常,名为IgCONDA-PET。训练取决于图像类标签(健康与不健康)以及针对不健康图像的隐式指导生成反例。反例生成过程生成给定不健康图像的反例,而两者之间的差异有助于异常位置的识别。代码可在此处访问:https://this URL
https://arxiv.org/abs/2405.00239
Given the emergence of deep learning, digital pathology has gained popularity for cancer diagnosis based on histology images. Deep weakly supervised object localization (WSOL) models can be trained to classify histology images according to cancer grade and identify regions of interest (ROIs) for interpretation, using inexpensive global image-class annotations. A WSOL model initially trained on some labeled source image data can be adapted using unlabeled target data in cases of significant domain shifts caused by variations in staining, scanners, and cancer type. In this paper, we focus on source-free (unsupervised) domain adaptation (SFDA), a challenging problem where a pre-trained source model is adapted to a new target domain without using any source domain data for privacy and efficiency reasons. SFDA of WSOL models raises several challenges in histology, most notably because they are not intended to adapt for both classification and localization tasks. In this paper, 4 state-of-the-art SFDA methods, each one representative of a main SFDA family, are compared for WSOL in terms of classification and localization accuracy. They are the SFDA-Distribution Estimation, Source HypOthesis Transfer, Cross-Domain Contrastive Learning, and Adaptively Domain Statistics Alignment. Experimental results on the challenging Glas (smaller, breast cancer) and Camelyon16 (larger, colon cancer) histology datasets indicate that these SFDA methods typically perform poorly for localization after adaptation when optimized for classification.
鉴于深度学习的出现,基于组织图像的癌症诊断在病理学图像中得到了广泛应用。可以训练深度弱监督物体定位(WSOL)模型根据癌症分期对组织图像进行分类,并识别感兴趣区域(ROIs)用于解释。使用廉价的全球图像类注释可以帮助训练WSOL模型。在本文中,我们关注源免费(无监督)领域适应(SFDA)问题,这是一个具有挑战性的问题, 在这种问题中,预训练的源模型被适应到新的目标领域,而不会使用任何源域数据,出于隐私和效率原因。 SFDA的WSOL模型在组织学中引起了几个挑战,尤其是因为他们不是为了适应分类和定位任务而设计的。在本文中,我们比较了四个最先进的SFDA方法,每个都是一些主要SFDA家族的代表,在WSOL方面的分类和定位准确性。它们是SFDA-分布估计、源假设转移、跨领域对比学习以及自适应领域统计对齐。在具有挑战性的Glas(较小,乳腺癌)和Camelyon16(较大,结肠癌)组织数据集的实验结果中,这些SFDA方法在优化分类时通常表现不佳。
https://arxiv.org/abs/2404.19113
The MEDIQA-M3G 2024 challenge necessitates novel solutions for Multilingual & Multimodal Medical Answer Generation in dermatology (wai Yim et al., 2024a). This paper addresses the limitations of traditional methods by proposing a weakly supervised learning approach for open-ended medical question-answering (QA). Our system leverages readily available MEDIQA-M3G images via a VGG16-CNN-SVM model, enabling multilingual (English, Chinese, Spanish) learning of informative skin condition representations. Using pre-trained QA models, we further bridge the gap between visual and textual information through multimodal fusion. This approach tackles complex, open-ended questions even without predefined answer choices. We empower the generation of comprehensive answers by feeding the ViT-CLIP model with multiple responses alongside images. This work advances medical QA research, paving the way for clinical decision support systems and ultimately improving healthcare delivery.
MEDIQA-M3G 2024 挑战需要为皮肤病多语种及多模态医疗答案生成提供新颖解决方案(Yim et al., 2024a)。本文通过提出一个弱监督学习方法来解决传统方法的局限性,为开放性问题医疗答案生成(QA)提供了新思路。我们的系统通过VGG16-CNN-SVM模型利用可用的MEDIQA-M3G图像,实现了多语言(英语,汉语,西班牙语)学习有用的皮肤病表示。通过预训练的 QA 模型,我们通过多模态融合进一步弥合视觉和文本信息之间的差距。这种方法在未定义答案选择的情况下处理复杂、开放性问题。通过在图像旁边提供多个回答,我们通过ViT-CLIP模型生成了全面的答案。这项工作推动了医疗 QA 研究的发展,为临床决策支持系统和最终提高 healthcare delivery 奠定了基础。
https://arxiv.org/abs/2405.01583
Slippery road weather conditions are prevalent in many regions and cause a regular risk for traffic. Still, there has been less research on how autonomous vehicles could detect slippery driving conditions on the road to drive safely. In this work, we propose a method to predict a dense grip map from the area in front of the car, based on postprocessed multimodal sensor data. We trained a convolutional neural network to predict pixelwise grip values from fused RGB camera, thermal camera, and LiDAR reflectance images, based on weakly supervised ground truth from an optical road weather sensor. The experiments show that it is possible to predict dense grip values with good accuracy from the used data modalities as the produced grip map follows both ground truth measurements and local weather conditions, such as snowy areas on the road. The model using only the RGB camera or LiDAR reflectance modality provided good baseline results for grip prediction accuracy while using models fusing the RGB camera, thermal camera, and LiDAR modalities improved the grip predictions significantly.
许多地区都普遍存在滑坡路况,这对交通安全造成了 regular 的风险。然而,关于自动驾驶汽车如何从车辆前方的区域预测滑坡路况以安全驾驶的研究还相对较少。在这项工作中,我们提出了一种基于 postprocessed multimodal sensor data 预测汽车前方区域稀疏抓地图的方法。我们训练了一个卷积神经网络,根据弱监督的地面真实值从融合 RGB 相机、热成像和激光雷达反照像预测像素级别的抓地值。实验结果表明,使用所提供的数据模态,可以预测出良好的抓地值,且生成的抓地图既符合地面真实测量值,又考虑了道路当地的天气状况,例如道路上的积雪区域。使用仅 RGB 相机或 LiDAR 反照像模型的模型,提供了良好的基线抓地预测精度,而将 RGB 相机、热成像和激光雷达模型的融合模型显著提高了抓地预测精度。
https://arxiv.org/abs/2404.17324
We propose a method to remotely verify the authenticity of Optically Variable Devices (OVDs), often referred to as ``holograms'', in identity documents. Our method processes video clips captured with smartphones under common lighting conditions, and is evaluated on two public datasets: MIDV-HOLO and MIDV-2020. Thanks to a weakly-supervised training, we optimize a feature extraction and decision pipeline which achieves a new leading performance on MIDV-HOLO, while maintaining a high recall on documents from MIDV-2020 used as attack samples. It is also the first method, to date, to effectively address the photo replacement attack task, and can be trained on either genuine samples, attack samples, or both for increased performance. By enabling to verify OVD shapes and dynamics with very little supervision, this work opens the way towards the use of massive amounts of unlabeled data to build robust remote identity document verification systems on commodity smartphones. Code is available at this https URL
我们提出了一个方法,用于通过远程验证身份证上的光学可变设备(OVDs),通常被称为“全息照片”的 authenticity。我们的方法处理智能手机上捕获的普通光照条件下的视频片段,并在两个公共数据集上进行评估:MIDV-HOLO 和 MIDV-2020。由于弱监督训练,我们优化了一个特征提取和决策管道,在 MIDV-HOLO 上实现了新的领先性能,同时保持对用作攻击样本的 MIDV-2020 的高度召回。此外,它是迄今为止第一个有效解决照片替换攻击任务的方法,可以用于真实样本、攻击样本或两者来提高性能。通过允许在不带大量监督的情况下验证 OVD 形状和动态,这项工作为使用大规模未标记数据构建 robust 远程身份证明系统提供了途径。代码可在此处访问:https://www.academia.edu/39411041/Transportable_OVD_Authentication
https://arxiv.org/abs/2404.17253
Weakly supervised medical image segmentation (MIS) using generative models is crucial for clinical diagnosis. However, the accuracy of the segmentation results is often limited by insufficient supervision and the complex nature of medical imaging. Existing models also only provide a single outcome, which does not allow for the measurement of uncertainty. In this paper, we introduce DiffSeg, a segmentation model for skin lesions based on diffusion difference which exploits diffusion model principles to ex-tract noise-based features from images with diverse semantic information. By discerning difference between these noise features, the model identifies diseased areas. Moreover, its multi-output capability mimics doctors' annotation behavior, facilitating the visualization of segmentation result consistency and ambiguity. Additionally, it quantifies output uncertainty using Generalized Energy Distance (GED), aiding interpretability and decision-making for physicians. Finally, the model integrates outputs through the Dense Conditional Random Field (DenseCRF) algorithm to refine the segmentation boundaries by considering inter-pixel correlations, which improves the accuracy and optimizes the segmentation results. We demonstrate the effectiveness of DiffSeg on the ISIC 2018 Challenge dataset, outperforming state-of-the-art U-Net-based methods.
弱监督下的医学图像分割(MIS)利用生成模型在临床诊断中至关重要。然而,分割结果的准确性常常受到监督不足和医学图像复杂性的限制。现有的模型仅提供单一输出,无法衡量不确定性。在本文中,我们介绍了DiffSeg,一种基于扩散差分的皮肤病变分割模型,它利用扩散模型原理从具有丰富语义信息的图像中提取噪声基于特征。通过鉴别这些噪声特征,模型识别出病变区域。此外,其多输出能力模仿了医生的标注行为,有助于可视化分割结果的一致性和不确定性。此外,通过使用泛化能量距离(GED)量化输出不确定性,有助于医生更好地解释和做出决策。最后,通过Dense Conditional Random Field(DenseCRF)算法将输出集成,通过考虑像素间关联来平滑分割边界,从而提高准确性和优化分割结果。我们在ISIC 2018挑战数据集上证明了DiffSeg的有效性,超越了基于U-Net的最先进方法。
https://arxiv.org/abs/2404.16474
Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable $2.7\times$ acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality. Our source code along with pre-trained model weights and training recipes is available at \url{this https URL}.
对比学习已成为通过图像和文本嵌入之间的对齐来学习有效视觉表示的一种变革性方法。然而,在图像和文本对之间的对比损失计算中,计算对偶相似性提出了计算挑战。本文提出了一种在面向互联网大小的图像-文本数据上的弱监督预训练视觉模型的新方法。将图像-文本数据的预训练重新定义为分类任务。因此,它消除了在对比学习在互联网大小的数据上进行对偶相似性计算的需求,实现了与对比学习在互联网大小的数据上训练的速度相比,训练速度提高了2.7倍。通过广泛的实验,包括检测和分割等不同视觉任务,我们证明了所提出的方法具有高表示质量。我们的源代码以及预训练模型权重和训练 recipe可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.15653
Weakly supervised segmentation methods have gained significant attention due to their ability to reduce the reliance on costly pixel-level annotations during model training. However, the current weakly supervised nuclei segmentation approaches typically follow a two-stage pseudo-label generation and network training process. The performance of the nuclei segmentation heavily relies on the quality of the generated pseudo-labels, thereby limiting its effectiveness. This paper introduces a novel domain-adaptive weakly supervised nuclei segmentation framework using cross-task interaction strategies to overcome the challenge of pseudo-label generation. Specifically, we utilize weakly annotated data to train an auxiliary detection task, which assists the domain adaptation of the segmentation network. To enhance the efficiency of domain adaptation, we design a consistent feature constraint module integrating prior knowledge from the source domain. Furthermore, we develop pseudo-label optimization and interactive training methods to improve the domain transfer capability. To validate the effectiveness of our proposed method, we conduct extensive comparative and ablation experiments on six datasets. The results demonstrate the superiority of our approach over existing weakly supervised approaches. Remarkably, our method achieves comparable or even better performance than fully supervised methods. Our code will be released in this https URL.
弱监督分割方法因其在模型训练过程中减少对昂贵像素级注释的依赖而受到广泛关注。然而,当前的弱监督核分割方法通常遵循两个阶段的伪标签生成和网络训练过程。核分割的表现很大程度上取决于生成的伪标签的质量,从而限制了其有效性的提高。本文提出了一种使用跨任务交互策略的新颖领域自适应弱监督核分割框架,以克服伪标签生成的挑战。具体来说,我们利用弱标注数据来训练辅助检测任务,从而帮助分割网络的领域适应。为了提高领域适应的效率,我们设计了一个一致的特征约束模块,整合了源域的知识。此外,我们还开发了伪标签优化和交互训练方法,以提高领域转移能力。为了验证我们提出的方法的有效性,我们在六个数据集上进行了广泛的比较和消融实验。结果表明,与现有弱监督方法相比,我们的方法具有优越性。值得注意的是,我们的方法甚至可能实现与完全监督方法相媲美的或更好的性能。我们的代码将在此处发布:https://URL。
https://arxiv.org/abs/2404.14956
Current point cloud semantic segmentation has achieved great advances when given sufficient labels. However, the dense annotation of LiDAR point clouds remains prohibitively expensive and time-consuming, unable to keep up with the continuously growing volume of data. In this paper, we propose annotating images with scattered points, followed by utilizing SAM (a Foundation model) to generate semantic segmentation labels for the images. Finally, by mapping the segmentation labels of the images to the LiDAR space using the intrinsic and extrinsic parameters of the camera and LiDAR, we obtain labels for point cloud semantic segmentation, and release Scatter-KITTI and Scatter-nuScenes, which are the first works to utilize image segmentation-based SAM for weakly supervised point cloud semantic segmentation. Furthermore, to mitigate the influence of erroneous pseudo labels obtained from sparse annotations on point cloud features, we propose a multi-modal weakly supervised network for LiDAR semantic segmentation, called MM-ScatterNet. This network combines features from both point cloud and image modalities, enhancing the representation learning of point clouds by introducing consistency constraints between multi-modal features and point cloud features. On the SemanticKITTI dataset, we achieve 66\% of fully supervised performance using only 0.02% of annotated data, and on the NuScenes dataset, we achieve 95% of fully supervised performance using only 0.1% labeled points.
当前的点云语义分割在给出充分标签时取得了很大的进展。然而,对激光雷达点云的密集标注仍然过于昂贵和耗时,无法跟上数据不断增长的数量。在本文中,我们提出使用散射点对图像进行标注,然后利用SAM(一个基础模型)对图像进行语义分割标签生成。最后,通过将图像的语义分割标签映射到激光雷达空间中的内、外参数,我们获得了点云语义分割标签,并释放了Scatter-KITTI和Scatter-nuScenes,这是第一个利用基于图像分割的SAM进行弱监督点云语义分割的工作。此外,为了减轻从稀疏标注中获得的错误伪标签对点云特征的影响,我们提出了一个多模态弱监督网络,称为MM-ScatterNet。该网络结合了点云和图像模态的特征,通过引入多模态特征与点云特征之间的一致性约束,增强了点云的表示学习。在SemanticKITTI数据集上,我们实现了66%的完全监督性能,只需要0.02%的注释数据,而在NuScenes数据集上,我们实现了95%的完全监督性能,只需要0.1%的标注点。
https://arxiv.org/abs/2404.12861
Deep learning is dramatically transforming the field of medical imaging and radiology, enabling the identification of pathologies in medical images, including computed tomography (CT) and X-ray scans. However, the performance of deep learning models, particularly in segmentation tasks, is often limited by the need for extensive annotated datasets. To address this challenge, the capabilities of weakly supervised semantic segmentation are explored through the lens of Explainable AI and the generation of counterfactual explanations. The scope of this research is development of a novel counterfactual inpainting approach (COIN) that flips the predicted classification label from abnormal to normal by using a generative model. For instance, if the classifier deems an input medical image X as abnormal, indicating the presence of a pathology, the generative model aims to inpaint the abnormal region, thus reversing the classifier's original prediction label. The approach enables us to produce precise segmentations for pathologies without depending on pre-existing segmentation masks. Crucially, image-level labels are utilized, which are substantially easier to acquire than creating detailed segmentation masks. The effectiveness of the method is demonstrated by segmenting synthetic targets and actual kidney tumors from CT images acquired from Tartu University Hospital in Estonia. The findings indicate that COIN greatly surpasses established attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an alternative counterfactual explanation method introduced by Singla et al. This evidence suggests that COIN is a promising approach for semantic segmentation of tumors in CT images, and presents a step forward in making deep learning applications more accessible and effective in healthcare, where annotated data is scarce.
深度学习正在深刻地改变医学影像和放射学领域,以前所未有的方式识别医学图像中的疾病,包括计算机断层扫描(CT)和X光扫描。然而,深度学习模型的性能,尤其是在分割任务中,常常受到需要大量注释数据的需求的限制。为解决这个问题,通过 Explainable AI 和生成反事实解释来探索弱监督语义分割模型的能力。这项研究旨在开发一种新颖的逆向修复方法(COIN),通过使用生成模型在预测分类标签异常的情况下,将预测分类标签从异常转为正常。例如,如果分类器认为输入医学图像X异常,表示存在疾病,生成模型旨在修复异常区域,从而反转分类器的原始预测标签。该方法使我们能够在不依赖预先存在的分割掩码的情况下精确地分割出疾病。关键的是,图像级标签被利用,这比创建详细的分割掩码要容易得多。该方法的效果由从爱沙尼亚图尔图大学医院的CT图像中分割出合成目标和实际肾肿瘤来证明。研究结果表明,COIN远远超过了已有的归因方法,如RISE、ScoreCAM和LayerCAM,以及Singla等人提出的另一种反事实解释方法。这一证据表明,COIN是用于CT图像肿瘤语义分割的有前景的方法,并为在医疗保健中使深度学习应用更具有可行性和效果铺平了道路,而注释数据又稀缺。
https://arxiv.org/abs/2404.12832