Content moderation faces a challenging task as social media's ability to spread hate speech contrasts with its role in promoting global connectivity. With rapidly evolving slang and hate speech, the adaptability of conventional deep learning to the fluid landscape of online dialogue remains limited. In response, causality inspired disentanglement has shown promise by segregating platform specific peculiarities from universal hate indicators. However, its dependency on available ground truth target labels for discerning these nuances faces practical hurdles with the incessant evolution of platforms and the mutable nature of hate speech. Using confidence based reweighting and contrastive regularization, this study presents HATE WATCH, a novel framework of weakly supervised causal disentanglement that circumvents the need for explicit target labeling and effectively disentangles input features into invariant representations of hate. Empirical validation across platforms two with target labels and two without positions HATE WATCH as a novel method in cross platform hate speech detection with superior performance. HATE WATCH advances scalable content moderation techniques towards developing safer online communities.
内容审查面临着一个具有挑战性的任务,因为社交媒体传播仇恨言论的能力与促进全球连通性的作用相矛盾。随着迅速变化的俚语和仇恨言论,传统深度学习对在线对话灵活领域的适应性仍然有限。为了应对这一挑战,因果性启发下的解耦方法已经表现出通过隔离平台特定奇异特点与通用仇恨指标的 fluid 场景的潜力。然而,其对可用目标标签进行判断的依赖性,在平台不断演进和仇恨言论多变性的情况下,面临着实际障碍。通过基于信心的重新加权和平衡对比 regularization,本研究提出了 HATE WATCH,一种新颖的弱监督因果解码框架,它绕过了明确的目标标签的需要,有效将输入特征解耦为不变的仇恨表示。在两个带有目标标签的平台和两个没有位置的平台进行实证验证后,HATE WATCH 作为跨平台仇恨言论检测的一种新颖方法,具有卓越的性能。HATE WATCH 为实现更安全的在线社区提供了可扩展的审查方法。
https://arxiv.org/abs/2404.11036
Featurizing microscopy images for use in biological research remains a significant challenge, especially for large-scale experiments spanning millions of images. This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs) when training with increasingly larger model backbones and microscopy datasets. Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks, achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases. Additionally, we develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time. We demonstrate that CA-MAEs effectively generalize by inferring and evaluating on a microscopy image dataset (JUMP-CP) generated under different experimental conditions with a different channel structure than our pretraining data (RPI-93M). Our findings motivate continued research into scaling self-supervised learning on microscopy data in order to create powerful foundation models of cellular biology that have the potential to catalyze advancements in drug discovery and beyond.
将显微图像特征化用于生物研究仍然是一个重要的挑战,尤其是在跨越数百万张图像的大型实验中。本文探讨了在训练过程中使用越来越大模型骨干和显微数据集时,弱监督分类器和自监督掩码自动编码器(MAEs)的扩展性质。我们的结果表明,基于ViT的MAEs在各种任务上优于弱监督分类器,在回忆来自公共数据库中预先整理的已知生物学关系时,相对改进多达11.5%。此外,我们开发了一种新的通道无关MAE架构(CA-MAE),允许在推理时输入不同数量和维度的图像。我们证明了CA-MAEs通过推断和评估来有效泛化,与我们的预训练数据(RPI-93M)生成具有不同通道结构的显微图像数据集(JUMP-CP)相比。我们的研究结果激励继续研究在显微数据上进行自监督学习,以创建有潜力的细胞生物学基础模型,该模型可以促进药物发现及其他领域的进步。
https://arxiv.org/abs/2404.10242
Weakly Supervised Object Localization (WSOL) allows for training deep learning models for classification and localization, using only global class-level labels. The lack of bounding box (bbox) supervision during training represents a considerable challenge for hyper-parameter search and model selection. Earlier WSOL works implicitly observed localization performance over a test set which leads to biased performance evaluation. More recently, a better WSOL protocol has been proposed, where a validation set with bbox annotations is held out for model selection. Although it does not rely on the test set, this protocol is unrealistic since bboxes are not available in real-world applications, and when available, it is better to use them directly to fit model weights. Our initial empirical analysis shows that the localization performance of a model declines significantly when using only image-class labels for model selection (compared to using bounding-box annotations). This suggests that adding bounding-box labels is preferable for selecting the best model for localization. In this paper, we introduce a new WSOL validation protocol that provides a localization signal without the need for manual bbox annotations. In particular, we leverage noisy pseudo boxes from an off-the-shelf ROI proposal generator such as Selective-Search, CLIP, and RPN pretrained models for model selection. Our experimental results with several WSOL methods on ILSVRC and CUB-200-2011 datasets show that our noisy boxes allow selecting models with performance close to those selected using ground truth boxes, and better than models selected using only image-class labels.
Weakly Supervised Object Localization (WSOL) 允许使用仅全局类别级别标签来训练深度学习模型进行分类和定位,而无需进行边界框(bbox)监督。在训练过程中缺乏边界框监督代表了超参数搜索和模型选择的相当大挑战。较早的 WSOL 工作在测试集上隐式观察到了局部化性能,从而导致了偏差性能评估。更最近,提出了一个更好的 WSOL 协议,其中为模型选择保留了一个带边界框注释的验证集。尽管这个协议不依赖于测试集,但它是不可信的,因为在现实世界中,边界框是不存在的,而当它们存在时,直接使用它们来调整模型权重会更好。我们的初始实证分析表明,当仅使用图像类标签进行模型选择时,模型的定位性能会显著下降。这表明,为选择最佳的位置模型,应该添加边界框标签。在本文中,我们引入了一个新的 WSOL 验证协议,不需要手动边界框注释来提供定位信号。特别地,我们利用了诸如 Selective-Search、CLIP 和 RPN 预训练模型等噪声伪盒,用于模型选择。我们在 ILSVRC 和 CUB-200-2011 数据集上与几个 WSOL 方法的实验结果表明,我们的噪声盒子能够选择性能接近于通过地面真实框选择的模型,并且比仅使用图像类标签选择的模型更好。
https://arxiv.org/abs/2404.10034
Existing deep trackers are typically trained with largescale video frames with annotated bounding boxes. However, these bounding boxes are expensive and time-consuming to annotate, in particular for large scale datasets. In this paper, we propose to learn tracking representations from single point annotations (i.e., 4.5x faster to annotate than the traditional bounding box) in a weakly supervised manner. Specifically, we propose a soft contrastive learning (SoCL) framework that incorporates target objectness prior into end-to-end contrastive learning. Our SoCL consists of adaptive positive and negative sample generation, which is memory-efficient and effective for learning tracking representations. We apply the learned representation of SoCL to visual tracking and show that our method can 1) achieve better performance than the fully supervised baseline trained with box annotations under the same annotation time cost; 2) achieve comparable performance of the fully supervised baseline by using the same number of training frames and meanwhile reducing annotation time cost by 78% and total fees by 85%; 3) be robust to annotation noise.
现有的深跟踪器通常使用带有注释边界框的大规模视频帧进行训练。然而,这些边界框通常是昂贵且耗时的,尤其是在大规模数据集上。在本文中,我们提出了一种弱监督方式从单点注释中学习跟踪表示。具体来说,我们提出了一种软对比学习(SoCL)框架,该框架将目标对象的 prior 融入了端到端的对比学习。我们的 SoCL 包括自适应的正负样本生成,这种方法具有内存效率和有效的学习跟踪表示。我们将 SoCL 学习到的表示应用于视觉跟踪,并证明了我们的方法可以在与相同注释时间成本下实现比完全监督基准更好的性能,2)通过使用相同的训练帧数量并减少78%的注释时间成本实现与完全监督基准相当的性能,3)对注释噪声具有鲁棒性。
https://arxiv.org/abs/2404.09504
Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole
弱监督视频异常检测(WSVAD)是一个具有挑战性的任务。通过基于弱标签生成细粒度伪标签,然后自训练分类器,这是一种有前景的解决方案。然而,由于现有方法仅使用RGB视觉模态,并且忽略了类文本信息的利用,因此限制了生成更精确伪标签的质量和影响了自训练的性能。受到基于事件描述的手动标注过程的启发,本文我们提出了一个基于Text Prompt with Normality Guidance(TPWNG)的新WSVAD伪标签生成和自训练框架。我们的想法是将预训练的对比语言-图像模型CLIP的丰富语言-视觉知识转移,以将视频事件描述文本和相应视频帧对齐生成伪标签。具体来说,我们首先通过设计两个排名损失和一个分布不协调损失来微调CLIP进行领域适应。然后,我们提出了一种可学习文本提示机制,通过正态性视觉提示进一步改善视频事件描述文本和视频帧的匹配准确性。接着,我们设计了一个基于正态性指导的伪标签生成模块,推断可靠的帧级伪标签。最后,我们引入了一个时间自适应学习模块,以更灵活和准确地学习不同视频事件的时序依赖关系。大量实验证明,我们的方法在两个基准数据集(UCF-Crime和XD-Viole)上的性能达到了最先进的水平。
https://arxiv.org/abs/2404.08531
Weakly supervised semantic segmentation (WSSS) with image-level labels intends to achieve dense tasks without laborious annotations. However, due to the ambiguous contexts and fuzzy regions, the performance of WSSS, especially the stages of generating Class Activation Maps (CAMs) and refining pseudo masks, widely suffers from ambiguity while being barely noticed by previous literature. In this work, we propose UniA, a unified single-staged WSSS framework, to efficiently tackle this issue from the perspective of uncertainty inference and affinity diversification, respectively. When activating class objects, we argue that the false activation stems from the bias to the ambiguous regions during the feature extraction. Therefore, we design a more robust feature representation with a probabilistic Gaussian distribution and introduce the uncertainty estimation to avoid the bias. A distribution loss is particularly proposed to supervise the process, which effectively captures the ambiguity and models the complex dependencies among features. When refining pseudo labels, we observe that the affinity from the prevailing refinement methods intends to be similar among ambiguities. To this end, an affinity diversification module is proposed to promote diversity among semantics. A mutual complementing refinement is proposed to initially rectify the ambiguous affinity with multiple inferred pseudo labels. More importantly, a contrastive affinity loss is further designed to diversify the relations among unrelated semantics, which reliably propagates the diversity into the whole feature representations and helps generate better pseudo masks. Extensive experiments are conducted on PASCAL VOC, MS COCO, and medical ACDC datasets, which validate the efficiency of UniA tackling ambiguity and the superiority over recent single-staged or even most multi-staged competitors.
弱监督语义分割(WSSS)采用图像级标签的目的是实现密集任务,而不需要费力地标注。然而,由于模糊的上下文和模糊区域,WSSS的性能,尤其是生成类激活图(CAM)和改进伪掩码的阶段,在很大程度上受到模糊性的影响,尽管在之前文献中 barely 被注意到。在本文中,我们提出UniA,一种统一的一阶段WSSS框架,从不确定性推理和异质扩展的角度来解决这个问题。当激活类别物体时,我们认为是由于在特征提取过程中对模糊区域的偏见导致的假激活。因此,我们设计了一个具有概率高斯分布的更稳健的特征表示,并引入不确定性估计来避免偏见。特别地,提出了分布损失来指导过程,有效地捕捉了不确定性和建模了特征之间的复杂关系。当优化伪标签时,我们观察到当前优化方法之间的异质性意图相似。为此,我们提出了一个异质化模块来促进语义之间的多样性。互为补充的优化被提出作为首先通过多个推断伪标签初始化模糊异质性的纠正。此外,还进一步设计了一个对比性异质性损失,以进一步分散无关语义之间的关系,可靠地将多样性传播到整个特征表示中,并帮助生成更优秀的伪掩码。在PASCAL VOC、MS COCO和医疗ACDC数据集上进行了广泛的实验,验证了UniA解决不确定性和优越性以及与最近单阶段或甚至是多阶段竞争者相比的效率。
https://arxiv.org/abs/2404.08195
This paper introduces MAD-MIL, a Multi-head Attention-based Deep Multiple Instance Learning model, designed for weakly supervised Whole Slide Images (WSIs) classification in digital pathology. Inspired by the multi-head attention mechanism of the Transformer, MAD-MIL simplifies model complexity while achieving competitive results against advanced models like CLAM and DS-MIL. Evaluated on the MNIST-BAGS and public datasets, including TUPAC16, TCGA BRCA, TCGA LUNG, and TCGA KIDNEY, MAD-MIL consistently outperforms ABMIL. This demonstrates enhanced information diversity, interpretability, and efficiency in slide representation. The model's effectiveness, coupled with fewer trainable parameters and lower computational complexity makes it a promising solution for automated pathology workflows. Our code is available at this https URL.
本文介绍了MAD-MIL,一种基于多头注意力机制的深度多实例学习模型,用于数字病理学中弱监督整张幻灯片(WSIs)分类。受到Transformer中多头注意力的启发,MAD-MIL通过简化模型复杂度同时对抗像CLAM和DS-MIL这样的先进模型。在MNIST-BAGS和公开数据集(包括TUPAC16、TCGA BRCA、TCGA LUNG和TCGA KIDNEY)上进行评估,MAD-MIL始终优于ABMIL。这证明了在幻灯片表示中信息多样性、可解释性和效率的增强。模型的有效性相结合较少的训练参数和较低的计算复杂性使其成为自动病理学工作流程的有前景的解决方案。我们的代码可在此处访问:https://www.xxxxxx。
https://arxiv.org/abs/2404.05362
In hematology, computational models offer significant potential to improve diagnostic accuracy, streamline workflows, and reduce the tedious work of analyzing single cells in peripheral blood or bone marrow smears. However, clinical adoption of computational models has been hampered by the lack of generalization due to large batch effects, small dataset sizes, and poor performance in transfer learning from natural images. To address these challenges, we introduce DinoBloom, the first foundation model for single cell images in hematology, utilizing a tailored DINOv2 pipeline. Our model is built upon an extensive collection of 13 diverse, publicly available datasets of peripheral blood and bone marrow smears, the most substantial open-source cohort in hematology so far, comprising over 380,000 white blood cell images. To assess its generalization capability, we evaluate it on an external dataset with a challenging domain shift. We show that our model outperforms existing medical and non-medical vision models in (i) linear probing and k-nearest neighbor evaluations for cell-type classification on blood and bone marrow smears and (ii) weakly supervised multiple instance learning for acute myeloid leukemia subtyping by a large margin. A family of four DinoBloom models (small, base, large, and giant) can be adapted for a wide range of downstream applications, be a strong baseline for classification problems, and facilitate the assessment of batch effects in new datasets. All models are available at this http URL.
在血液学中,计算模型具有显著的提高诊断准确度、简化工作流程和减轻分析单个细胞在外周血或骨髓涂片中的繁琐工作的潜力。然而,临床采用计算模型受到了由于大规模批效应、数据集较小以及自然图像迁移学习性能差等问题的阻碍。为解决这些问题,我们引入了DinoBloom,第一个用于血液学单个细胞图像的基础模型,利用定制化的DINOv2管道。我们的模型基于一个广泛的 peripheral blood 和 bone marrow smears 的13个不同的公开可用数据集,这是目前血液学开放源代码队列中最大的,包括超过380,000个白细胞图像。为了评估其泛化能力,我们在具有具有挑战性领域转移的外部数据集上对其进行评估。我们发现,我们的模型在(i)血液和骨髓涂片细胞类型分类的线性探测和k-最近邻评估以及(ii)大样本弱监督多实例学习急性髓系白血病亚型分型的性能方面均优于现有医学和非医学视觉模型。四款DinoBloom模型(小、基础、大、巨)可以适应广泛的下游应用,可以作为分类问题的强基线,并有助于在新技术数据集中评估批效应。所有模型都可以在上述http URL找到。
https://arxiv.org/abs/2404.05022
Deep quantization methods have shown high efficiency on large-scale image retrieval. However, current models heavily rely on ground-truth information, hindering the application of quantization in label-hungry scenarios. A more realistic demand is to learn from inexhaustible uploaded images that are associated with informal tags provided by amateur users. Though such sketchy tags do not obviously reveal the labels, they actually contain useful semantic information for supervising deep quantization. To this end, we propose Weakly-Supervised Deep Hyperspherical Quantization (WSDHQ), which is the first work to learn deep quantization from weakly tagged images. Specifically, 1) we use word embeddings to represent the tags and enhance their semantic information based on a tag correlation graph. 2) To better preserve semantic information in quantization codes and reduce quantization error, we jointly learn semantics-preserving embeddings and supervised quantizer on hypersphere by employing a well-designed fusion layer and tailor-made loss functions. Extensive experiments show that WSDHQ can achieve state-of-art performance on weakly-supervised compact coding. Code is available at this https URL.
深度量化方法在大型图像检索任务上表现出了高效性。然而,当前的模型在很大程度上依赖于真实数据,这阻碍了在有标签的场景中应用量化。一个更现实的需求是学习来自非正式标签的不可用上传图像,这些图像与业余用户提供的标签相关。尽管这些标签并不明显地揭示了标签,但它们实际上包含有关深度量化的有用语义信息。为此,我们提出了弱监督深度超球量化(WSDHQ),这是第一个从弱标签图像中学习深度量化的工作。具体来说,1)我们使用词向量来表示标签,并根据标签相关图增强其语义信息。2)为了更好地保留语义信息在量化代码中,并减少量化误差,我们通过采用设计巧妙的融合层和定制损失函数,在超球上共同学习和语义保持嵌入。大量实验证明,WSDHQ可以在弱监督的紧凑编码上实现最先进的性能。代码可在此处下载:https://url.cn/xyz6hU6
https://arxiv.org/abs/2404.04998
The diagnosis of primary liver cancers (PLCs) can be challenging, especially on biopsies and for combined hepatocellular-cholangiocarcinoma (cHCC-CCA). We automatically classified PLCs on routine-stained biopsies using a weakly supervised learning method. Weak tumour/non-tumour annotations served as labels for training a Resnet18 neural network, and the network's last convolutional layer was used to extract new tumour tile features. Without knowledge of the precise labels of the malignancies, we then applied an unsupervised clustering algorithm. Our model identified specific features of hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (iCCA). Despite no specific features of cHCC-CCA being recognized, the identification of HCC and iCCA tiles within a slide could facilitate the diagnosis of primary liver cancers, particularly cHCC-CCA. Method and results: 166 PLC biopsies were divided into training, internal and external validation sets: 90, 29 and 47 samples. Two liver pathologists reviewed each whole-slide hematein eosin saffron (HES)-stained image (WSI). After annotating the tumour/non-tumour areas, 256x256 pixel tiles were extracted from the WSIs and used to train a ResNet18. The network was used to extract new tile features. An unsupervised clustering algorithm was then applied to the new tile features. In a two-cluster model, Clusters 0 and 1 contained mainly HCC and iCCA histological features. The diagnostic agreement between the pathological diagnosis and the model predictions in the internal and external validation sets was 100% (11/11) and 96% (25/26) for HCC and 78% (7/9) and 87% (13/15) for iCCA, respectively. For cHCC-CCA, we observed a highly variable proportion of tiles from each cluster (Cluster 0: 5-97%; Cluster 1: 2-94%).
原发性肝癌(PLCs)的诊断可能具有挑战性,尤其是在活检和联合肝细胞-胆管癌(cHCC-CCA)的情况下。我们使用弱监督学习方法对常规染色活检中的PLC进行自动分类。弱肿瘤/非肿瘤注释充当训练Resnet18神经网络的标签,网络的最后一卷积层用于提取新的肿瘤拓扑特征。在没有肿瘤的准确标签的情况下,我们 then 应用了无监督聚类算法。我们的模型识别出了肝细胞癌(HCC)和肝内胆管癌(iCCA)的特定特征。尽管没有识别到cHCC-CCA的特定特征,但在同一张幻灯片中检测到HCC和iCCA的肿瘤和正常组织片段可以帮助早期诊断原发性肝癌,特别是cHCC-CCA。方法与结果:166个PLC活检样本分为训练、内部和外部验证集:90、29和47个样本。两名肝病学家审查了每个整个幻灯片的苏丹黑(HES)染色图像(WSI)。在对肿瘤/非肿瘤区域进行标注后,从WSIs中提取了256x256像素的瓷砖用于训练Resnet18。网络用于提取新的瓷砖特征。然后应用无监督聚类算法对新的瓷砖特征进行聚类。在双聚类模型中,Cluster 0和1包含主要HCC和iCCA的病理组织学特征。在内部和外部验证集上,病理诊断与模型预测之间的诊断一致性分别为100%(11/11)和96%(25/26),HCC和iCCA分别为78%(7/9)和87%(13/15)。对于cHCC-CCA,我们观察到每个簇中瓷砖的比例高度变异性(Cluster 0:5-97%;Cluster 1:2-94%)。
https://arxiv.org/abs/2404.04983
In weakly-supervised semantic segmentation (WSSS) using only image-level class labels, a problem with CNN-based Class Activation Maps (CAM) is that they tend to activate the most discriminative local regions of objects. On the other hand, methods based on Transformers learn global features but suffer from the issue of background noise contamination. This paper focuses on addressing the issue of background noise in attention weights within the existing WSSS method based on Conformer, known as TransCAM. The proposed method successfully reduces background noise, leading to improved accuracy of pseudo labels. Experimental results demonstrate that our model achieves segmentation performance of 70.5% on the PASCAL VOC 2012 validation data, 71.1% on the test data, and 45.9% on MS COCO 2014 data, outperforming TransCAM in terms of segmentation performance.
在仅使用图像级别类标签的弱监督语义分割(WSSS)中,基于CNN的分类激活图(CAM)的一个问题在于,它们倾向于激活对象中最具有区分性的局部区域。另一方面,基于Transformer的方法可以学习全局特征,但受到背景噪声污染的问题。本文重点关注在基于Conformer的现有WSSS方法中解决背景噪声问题,称为TransCAM。与TransCAM相比,所提出的方法成功减少了背景噪声,从而提高了伪标签的准确性。实验结果表明,我们的模型在PASCAL VOC 2012验证数据上的分割性能为70.5%,在测试数据上的分割性能为71.1%,在MS COCO 2014数据上的分割性能为45.9%,均优于TransCAM。
https://arxiv.org/abs/2404.03394
Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object, particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance, which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation, in this paper, we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by predicting the class of sounding entities in mixtures. Subsequently, the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures, leveraging the tri-modal AudioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods.
视觉声音源定位在确定每个声音样本中的语义区域方面是一个重大的挑战。现有的自监督和弱监督源定位方法很难准确地区分每个声音样本的语义区域,特别是在多源混合情况下。这些方法通常依赖于音频-视觉对应关系作为指导,这可能导致在复杂的多源定位场景中性能下降。在多源混合中访问单个源声音会加剧学习有效音频-视觉对应关系的难度。为解决这个问题,本文提出使用三模态联合嵌入模型(如AudioCLIP)将文本模态作为中间特征指导,以分离多源混合中的语义音频-视觉源对应关系。我们的框架名为T-VSL,首先预测混合中的声音样本类别。接着,每个声音样本的文本表示被用作指导,以从多源混合中分离细粒度的音频-视觉源对应关系,利用三模态AudioCLIP嵌入。这种方法使我们框架能够处理灵活的源数量,并且在测试时间表现出具有前景的零 shot 转移性。在MUSIC、VGGSound和VGGSound-Instruments数据集上进行的大量实验证明,与最先进的 methods相比,性能有显著的提高。
https://arxiv.org/abs/2404.01751
Whole Slide Images (WSI), obtained by high-resolution digital scanning of microscope slides at multiple scales, are the cornerstone of modern Digital Pathology. However, they represent a particular challenge to AI-based/AI-mediated analysis because pathology labeling is typically done at slide-level, instead of tile-level. It is not just that medical diagnostics is recorded at the specimen level, the detection of oncogene mutation is also experimentally obtained, and recorded by initiatives like The Cancer Genome Atlas (TCGA), at the slide level. This configures a dual challenge: a) accurately predicting the overall cancer phenotype and b) finding out what cellular morphologies are associated with it at the tile level. To address these challenges, a weakly supervised Multiple Instance Learning (MIL) approach was explored for two prevalent cancer types, Invasive Breast Carcinoma (TCGA-BRCA) and Lung Squamous Cell Carcinoma (TCGA-LUSC). This approach was explored for tumor detection at low magnification levels and TP53 mutations at various levels. Our results show that a novel additive implementation of MIL matched the performance of reference implementation (AUC 0.96), and was only slightly outperformed by Attention MIL (AUC 0.97). More interestingly from the perspective of the molecular pathologist, these different AI architectures identify distinct sensitivities to morphological features (through the detection of Regions of Interest, RoI) at different amplification levels. Tellingly, TP53 mutation was most sensitive to features at the higher applications where cellular morphology is resolved.
Whole Slide Images( WSI)是通过高分辨率数字扫描显微镜幻灯片的多尺度获得的,是现代数字病理学的基石。然而,它们对基于/介导分析来说是一个特别的挑战,因为通常是在幻灯片级别而不是单元级别对病理学进行标注。这不仅是医疗诊断记录在样品级别,抑癌基因突变也在幻灯片级别通过如 The Cancer Genome Atlas(TCGA)等倡议实验获得并记录。这构成了一个双重挑战:a)准确预测整体癌症表型和 b)在单元级别发现与它相关的细胞形态学。为了应对这些挑战,我们探讨了两种常见的癌症类型的弱监督多重实例学习(MIL)方法,侵袭性乳腺癌(TCGA-BRCA)和肺鳞状细胞癌(TCGA-LUSC)。这种方法在低放大倍数水平上探索了肿瘤检测和 TP53 突变 various 级别。我们的结果表明,与参考实现(AUC 0.96)相匹敌的新颖的 MIL 实现了卓越的性能,并略微超过了注意力 MIL(AUC 0.97)。从分子病理学家的角度来看,这些不同的 AI 架构对形态学特征的敏感性有所不同(通过检测感兴趣区域 RoI)。值得注意的是,TP53 突变在较高应用中对于细胞形态学有较清晰的分辨率时最为敏感。
https://arxiv.org/abs/2404.01446
Monocular 3D object detection poses a significant challenge in 3D scene understanding due to its inherently ill-posed nature in monocular depth estimation. Existing methods heavily rely on supervised learning using abundant 3D labels, typically obtained through expensive and labor-intensive annotation on LiDAR point clouds. To tackle this problem, we propose a novel weakly supervised 3D object detection framework named VSRD (Volumetric Silhouette Rendering for Detection) to train 3D object detectors without any 3D supervision but only weak 2D supervision. VSRD consists of multi-view 3D auto-labeling and subsequent training of monocular 3D object detectors using the pseudo labels generated in the auto-labeling stage. In the auto-labeling stage, we represent the surface of each instance as a signed distance field (SDF) and render its silhouette as an instance mask through our proposed instance-aware volumetric silhouette rendering. To directly optimize the 3D bounding boxes through rendering, we decompose the SDF of each instance into the SDF of a cuboid and the residual distance field (RDF) that represents the residual from the cuboid. This mechanism enables us to optimize the 3D bounding boxes in an end-to-end manner by comparing the rendered instance masks with the ground truth instance masks. The optimized 3D bounding boxes serve as effective training data for 3D object detection. We conduct extensive experiments on the KITTI-360 dataset, demonstrating that our method outperforms the existing weakly supervised 3D object detection methods. The code is available at this https URL.
单目3D物体检测在3D场景理解中面临一个显著的挑战,因为其固有的不正确性在单目深度估计中。现有的方法在很大程度上依赖于使用丰富3D标签的监督学习,通常是通过在激光雷达点云上花费昂贵且费力的人工标注来获得的。为了解决这个问题,我们提出了一种新颖的弱监督3D物体检测框架,名为VSRD(体积轮廓渲染检测),以在没有3D监督的情况下训练3D物体检测器,但仅使用弱2D监督。VSRD包括多视角3D自动标注和自动标注阶段生成伪标签后训练单目3D物体检测器。在自动标注阶段,我们将每个实例的表面表示为一个有符号距离场(SDF),并通过我们提出的实例感知体积轮廓渲染将其轮廓渲染为实例掩码。为了通过渲染直接优化3D边界框,我们将每个实例的SDF分解为立方体的SDF和残差距离场(RDF),这使得我们能够通过比较渲染实例掩码与真实实例掩码来优化3D边界框。通过在KITTI-360数据集上进行广泛的实验,我们证明了我们的方法优于现有的弱监督3D物体检测方法。代码可在此处访问:https://www.kitti.org/data/
https://arxiv.org/abs/2404.00149
Medical image segmentation of anatomical structures and pathology is crucial in modern clinical diagnosis, disease study, and treatment planning. To date, great progress has been made in deep learning-based segmentation techniques, but most methods still lack data efficiency, generalizability, and interactability. Consequently, the development of new, precise segmentation methods that demand fewer labeled datasets is of utmost importance in medical image analysis. Recently, the emergence of foundation models, such as CLIP and Segment-Anything-Model (SAM), with comprehensive cross-domain representation opened the door for interactive and universal image segmentation. However, exploration of these models for data-efficient medical image segmentation is still limited, but is highly necessary. In this paper, we propose a novel framework, called MedCLIP-SAM that combines CLIP and SAM models to generate segmentation of clinical scans using text prompts in both zero-shot and weakly supervised settings. To achieve this, we employed a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss to fine-tune the BiomedCLIP model and the recent gScoreCAM to generate prompts to obtain segmentation masks from SAM in a zero-shot setting. Additionally, we explored the use of zero-shot segmentation labels in a weakly supervised paradigm to improve the segmentation quality further. By extensively testing three diverse segmentation tasks and medical image modalities (breast tumor ultrasound, brain tumor MRI, and lung X-ray), our proposed framework has demonstrated excellent accuracy.
医学图像分割解剖结构和病理学在现代临床诊断、疾病研究和治疗规划中至关重要。到目前为止,基于深度学习的分割技术已经取得了很大进展,但大多数方法仍然缺乏数据效率、可扩展性和交互性。因此,开发新的、精确的分割方法,要求更少的标记数据是医学图像分析中至关重要的事情。最近,出现了以CLIP和Segment-Anything-Model (SAM)为代表的全面跨领域表示的基础模型,为交互和通用图像分割打开了大门。然而,为数据高效的医学图像分割探索这些模型仍然有限,但非常必要。在本文中,我们提出了一个新框架,称为MedCLIP-SAM,它将CLIP和SAM模型相结合,用于通过文本提示在零散和弱监督设置中生成临床扫描的分割。为达到这一目标,我们采用了一种新的解耦 hard 负噪声contrastive estimation (DHN-NCE)损失来微调生物医学CLIP模型和最近的gScoreCAM,以从SAM中生成提示获得分割掩码。此外,我们还研究了在弱监督范式中使用零散分割标签来提高分割质量。通过广泛测试三种不同的分割任务和医学图像类别(乳腺肿瘤超声、脑肿瘤MRI和肺X光),我们的框架在精确性方面已经证明了卓越的准确性。
https://arxiv.org/abs/2403.20253
Purpose: Surgical video is an important data stream for gesture recognition. Thus, robust visual encoders for those data-streams is similarly important. Methods: Leveraging the Bridge-Prompt framework, we fine-tune a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside video data such as text, but also make use of label meta-data and weakly supervised contrastive losses. Results: Our experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks. Notably, it displays strong performance in zero-shot scenarios, where gestures/tasks that were not provided during the encoder training phase are included in the prediction phase. Additionally, we measure the benefit of inclusion text descriptions in the feature extractor training schema. Conclusion: Bridge-Prompt and similar pre-trained+fine-tuned video encoder models present significant visual representation for surgical robotics, especially in gesture recognition tasks. Given the diverse range of surgical tasks (gestures), the ability of these models to zero-shot transfer without the need for any task (gesture) specific retraining makes them invaluable.
目的:手术视频是手势识别的重要数据流。因此,对于这些数据流,同样重要的是具有稳健的视觉编码器。方法:利用Bridge-Prompt框架,我们对一个预训练的视觉-文本模型(CLIP)进行手术视频手势识别的微调。这可以利用广泛的外部视频数据(如文本),但还利用标签元数据和弱监督的对比损失。结果:我们的实验结果表明,基于提示的视频编码器在手术手势识别任务中优于标准编码器。值得注意的是,在编码器训练阶段没有提供手势/任务的情况下,表现出优异的零散场景性能。此外,我们还在特征提取器训练方案中测量了包含文本描述的好处。结论:Bridge-Prompt和其他预训练+微调的视频编码器模型为手术机器人提供了显著的视觉表示,特别是在手势识别任务中。考虑到手术任务的多样范围(手势),这些模型在没有任何特定任务(手势)重新训练的情况下实现零散场景转移的能力使其无价。
https://arxiv.org/abs/2403.19786
In recent years, research on point weakly supervised object detection (PWSOD) methods in the field of computer vision has attracted people's attention. However, existing pseudo labels generation methods perform poorly in a small amount of supervised annotation data and dense object detection tasks. We consider the generation of weakly supervised pseudo labels as the result of model's sparse output, and propose a method called Sparse Generation to make pseudo labels sparse. It constructs dense tensors through the relationship between data and detector model, optimizes three of its parameters, and obtains a sparse tensor via coordinated calculation, thereby indirectly obtaining higher quality pseudo labels, and solving the model's density problem in the situation of only a small amount of supervised annotation data can be used. On two broadly used open-source datasets (RSOD, SIMD) and a self-built dataset (Bullet-Hole), the experimental results showed that the proposed method has a significant advantage in terms of overall performance metrics, comparing to that state-of-the-art method.
近年来,计算机视觉领域关于点弱监督物体检测(PWSOD)方法的研究引起了人们的关注。然而,现有的伪标签生成方法在少量监督标注数据和密集物体检测任务上表现不佳。我们考虑弱监督伪标签生成是模型稀疏输出的结果,并提出了名为Sparse Generation的方法来使伪标签稀疏。它通过数据与检测器模型之间的关系构建密集向量,优化三个参数,并通过协同计算获得稀疏向量,从而间接地获得更高质量的伪标签,解决了仅少量监督标注数据情况下模型密度的难题。在两个广泛使用的开源数据集(RSOD,SIMD)和自建数据集(Bullet-Hole)上进行的实验结果表明,与最先进的 methods相比,所提出的方法在整体性能指标上具有显著的优势。
https://arxiv.org/abs/2403.19306
We propose a voting-driven semi-supervised approach to automatically acquire the typical duration of an event and use it as pseudo-labeled data. The human evaluation demonstrates that our pseudo labels exhibit surprisingly high accuracy and balanced coverage. In the temporal commonsense QA task, experimental results show that using only pseudo examples of 400 events, we achieve performance comparable to the existing BERT-based weakly supervised approaches that require a significant amount of training examples. When compared to the RoBERTa baselines, our best approach establishes state-of-the-art performance with a 7% improvement in Exact Match.
我们提出了一种投票驱动的半监督方法,用于自动获取事件的典型持续时间,并将其用作伪标签数据。人类评估表明,我们的伪标签具有令人惊讶的准确性和平衡覆盖。在时间常识问题问答任务中,实验结果表明,仅使用400个事件的伪例子,我们在实现与现有基于BERT的弱监督方法相当的表现方面取得了成功。与RoBERTa基线相比,我们最佳的方法通过提高Exact Match的7%的性能,实现了最先进的表现。
https://arxiv.org/abs/2403.18504
In digital pathology, the multiple instance learning (MIL) strategy is widely used in the weakly supervised histopathology whole slide image (WSI) classification task where giga-pixel WSIs are only labeled at the slide level. However, existing attention-based MIL approaches often overlook contextual information and intrinsic spatial relationships between neighboring tissue tiles, while graph-based MIL frameworks have limited power to recognize the long-range dependencies. In this paper, we introduce the integrative graph-transformer framework that simultaneously captures the context-aware relational features and global WSI representations through a novel Graph Transformer Integration (GTI) block. Specifically, each GTI block consists of a Graph Convolutional Network (GCN) layer modeling neighboring relations at the local instance level and an efficient global attention model capturing comprehensive global information from extensive feature embeddings. Extensive experiments on three publicly available WSI datasets: TCGA-NSCLC, TCGA-RCC and BRIGHT, demonstrate the superiority of our approach over current state-of-the-art MIL methods, achieving an improvement of 1.0% to 2.6% in accuracy and 0.7%-1.6% in AUROC.
在数字病理学中,在弱监督下的组织病理学全片图像(WSI)分类任务中,仅在玻片级别对巨型像素WSI进行标注的情况中,多实例学习(MIL)策略被广泛使用。然而,现有的以注意力为基础的MIL方法通常忽视了相邻组织薄片之间的上下文信息以及它们的固有空间关系,而基于图的MIL框架则对远距离依赖关系的识别能力有限。在本文中,我们引入了一个整合图Transformer框架,通过一种新颖的图Transformer集成(GTI)块同时捕捉上下文感知的关联特征和全局WSI表示。具体来说,每个GTI块包括一个局部图卷积网络(GCN)层建模邻近关系以及一个高效的全局注意力模型,从广泛的特征嵌入中捕获全面的全局信息。对三个公开可用的WSI数据集:TCGA-NSCLC、TCGA-RCC和BRIGHT的实验表明,与其他最先进的MIL方法相比,我们的方法具有优越性,实现了准确度的提高1.0%至2.6%和AUROC的提高0.7%至1.6%。
https://arxiv.org/abs/2403.18134
In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose "WordRobe", a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent disentanglement, promoting better latent interpolation. Subsequently, we align the garment latent space to the CLIP embedding space in a weakly supervised manner, enabling text-driven 3D garment generation and editing. For appearance modeling, we leverage the zero-shot generation capability of ControlNet to synthesize view-consistent texture maps in a single feed-forward inference step, thereby drastically decreasing the generation time as compared to existing methods. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment interpolation, and text-driven texture synthesis, supported by quantitative evaluation and qualitative user study. The unposed 3D garment meshes generated using WordRobe can be directly fed to standard cloth simulation & animation pipelines without any post-processing.
在本文中,我们解决了一个新颖且具有挑战性的问题:用高品质纹理的文本驱动生成3D服装。我们提出了 "WordRobe",一种用于从用户友好的文本提示中生成非姿态&纹理3D服装网格的新颖框架。我们通过首先使用一种新颖的粗到细的训练策略和学习3D服装的潜在表示来达到这个目标,并使用潜在解分枝损失促进更好的潜在互插。随后,我们以弱监督的方式将服装潜在空间与CLIP嵌入空间对齐,实现基于文本的3D服装生成和编辑。对于外观建模,我们利用控制网的零 shot生成能力,在单次前馈推理步骤中合成视图一致的纹理映射,从而大大缩短了生成时间,与现有方法相比。我们在现有SOTAs中证明了卓越的性能,包括学习3D服装潜在空间、服装插值和文本驱动纹理合成,这是通过定量和用户研究来支持的。使用WordRobe生成的非姿态3D服装网格可以直接输入到标准的布料模拟和动画流水线中,无需后处理。
https://arxiv.org/abs/2403.17541