Cervical cancer stands as a predominant cause of female mortality, underscoring the need for regular screenings to enable early diagnosis and preemptive treatment of pre-cancerous conditions. The transformation zone in the cervix, where cellular differentiation occurs, plays a critical role in the detection of abnormalities. Colposcopy has emerged as a pivotal tool in cervical cancer prevention since it provides a meticulous examination of cervical abnormalities. However, challenges in visual evaluation necessitate the development of Computer Aided Diagnosis (CAD) systems. We propose a novel CAD system that combines the strengths of various deep-learning descriptors (ResNet50, ResNet101, and ResNet152) with appropriate feature normalization (min-max) as well as feature reduction technique (LDA). The combination of different descriptors ensures that all the features (low-level like edges and colour, high-level like shape and texture) are captured, feature normalization prevents biased learning, and feature reduction avoids overfitting. We do experiments on the IARC dataset provided by WHO. The dataset is initially segmented and balanced. Our approach achieves exceptional performance in the range of 97%-100% for both the normal-abnormal and the type classification. A competitive approach for type classification on the same dataset achieved 81%-91% performance.
宫颈癌是女性死亡的主要原因,这凸显了定期筛查的重要性,以实现早期诊断和预防性治疗。宫颈转变成分的关键区域,在该区域发生细胞分化时,对于异常情况的检测至关重要。自从宫颈内窥镜作为一种关键工具在宫颈癌预防中得到广泛应用以来,它提供了对宫颈异常的详细检查。然而,视觉评估方面的挑战迫使开发了计算机辅助诊断(CAD)系统。我们提出了一个结合各种深度学习描述符(ResNet50、ResNet101 和 ResNet152)的适当特征归一化(min-max)以及特征减少技术(LDA)的新颖 CAD 系统。不同描述器的组合确保所有特征(低级如边缘和颜色,高级如形状和纹理)都被捕捉到,特征归一化防止了偏差学习,特征减少避免了过拟合。我们在世界卫生组织(WHO)提供的 IARC 数据集上进行实验。该数据集最初进行分割和平衡。我们的方法在正常-异常和类型分类上取得了非常出色的成绩,达到97%-100%。在同一数据集上,对于类型分类的竞争方法获得了81%-91%的性能。
https://arxiv.org/abs/2405.01600
Data augmentation serves as a popular regularization technique to combat overfitting challenges in neural networks. While automatic augmentation has demonstrated success in image classification tasks, its application to time-series problems, particularly in long-term forecasting, has received comparatively less attention. To address this gap, we introduce a time-series automatic augmentation approach named TSAA, which is both efficient and easy to implement. The solution involves tackling the associated bilevel optimization problem through a two-step process: initially training a non-augmented model for a limited number of epochs, followed by an iterative split procedure. During this iterative process, we alternate between identifying a robust augmentation policy through Bayesian optimization and refining the model while discarding suboptimal runs. Extensive evaluations on challenging univariate and multivariate forecasting benchmark problems demonstrate that TSAA consistently outperforms several robust baselines, suggesting its potential integration into prediction pipelines.
数据增强是一种流行的正则化技术,用于解决神经网络中的过拟合挑战。虽然自动增强在图像分类任务中已经取得了成功,但其在时间序列问题上的应用,尤其是在长期预测方面,受到了相对较少的关注。为了填补这一空白,我们引入了一种名为TSAA的时间序列自动增强方法,它既高效又易于实现。解决方案涉及通过两个步骤解决相关双层优化问题:首先,对有限个 epoch 进行非增强模型训练,然后进行迭代拆分过程。在迭代过程中,我们通过贝叶斯优化识别出鲁棒增强策略,同时丢弃次优运行结果。对具有挑战性的单变量和多变量预测基准问题进行广泛的评估表明,TSAA始终优于几个稳健的基线,表明其可能被集成到预测管道中。
https://arxiv.org/abs/2405.00319
In patent prosecution, image-based retrieval systems for identifying similarities between current patent images and prior art are pivotal to ensure the novelty and non-obviousness of patent applications. Despite their growing popularity in recent years, existing attempts, while effective at recognizing images within the same patent, fail to deliver practical value due to their limited generalizability in retrieving relevant prior art. Moreover, this task inherently involves the challenges posed by the abstract visual features of patent images, the skewed distribution of image classifications, and the semantic information of image descriptions. Therefore, we propose a language-informed, distribution-aware multimodal approach to patent image feature learning, which enriches the semantic understanding of patent image by integrating Large Language Models and improves the performance of underrepresented classes with our proposed distribution-aware contrastive losses. Extensive experiments on DeepPatent2 dataset show that our proposed method achieves state-of-the-art or comparable performance in image-based patent retrieval with mAP +53.3%, Recall@10 +41.8%, and MRR@10 +51.9%. Furthermore, through an in-depth user analysis, we explore our model in aiding patent professionals in their image retrieval efforts, highlighting the model's real-world applicability and effectiveness.
在专利申请过程中,基于图像的检索系统用于识别当前专利图像与先驱技术的相似性,以确保专利申请的新颖性和非显而易见性。尽管近年来它在同一专利中识别图像的有效性得到了显著提高,但现有的尝试虽然能在同一专利中识别图像,但由于其在提取相关先驱技术方面具有有限的可扩展性,导致实际应用价值有限。此外,这项任务本身涉及专利图像抽象视觉特征、图像分类分布的不对称性和图像描述的语义信息等挑战。因此,我们提出了一个语言驱动、分布关注的多模态方法来进行专利图像特征学习,通过整合大型语言模型和改进我们提出的分布关注对比损失,从而丰富专利图像的语义理解。在DeepPatent2数据集上的大量实验证明,我们提出的方法在基于图像的专利检索中实现了与mAP +53.3%和Recall@10 +41.8%相当或更好的性能。通过深入的用户分析,我们探讨了我们的模型如何帮助专利专业人员提高图像检索工作,突出了模型的实际应用价值和有效性。
https://arxiv.org/abs/2404.19360
In the image classification task, deep neural networks frequently rely on bias attributes that are spuriously correlated with a target class in the presence of dataset bias, resulting in degraded performance when applied to data without bias attributes. The task of debiasing aims to compel classifiers to learn intrinsic attributes that inherently define a target class rather than focusing on bias attributes. While recent approaches mainly focus on emphasizing the learning of data samples without bias attributes (i.e., bias-conflicting samples) compared to samples with bias attributes (i.e., bias-aligned samples), they fall short of directly guiding models where to focus for learning intrinsic features. To address this limitation, this paper proposes a method that provides the model with explicit spatial guidance that indicates the region of intrinsic features. We first identify the intrinsic features by investigating the class-discerning common features between a bias-aligned (BA) sample and a bias-conflicting (BC) sample (i.e., bias-contrastive pair). Next, we enhance the intrinsic features in the BA sample that are relatively under-exploited for prediction compared to the BC sample. To construct the bias-contrastive pair without using bias information, we introduce a bias-negative score that distinguishes BC samples from BA samples employing a biased model. The experiments demonstrate that our method achieves state-of-the-art performance on synthetic and real-world datasets with various levels of bias severity.
在图像分类任务中,深度神经网络经常依赖具有 dataset bias 中的目标类偏见属性的偏见属性,从而在应用无偏见数据时导致性能下降。去偏任务旨在迫使分类器学习固有属性,而不是将重点放在偏见属性上。尽管最近的方法主要关注强调无偏见属性数据样本的学习(即 bias-conflicting 样本)与有偏见属性样本(即 bias-aligned 样本)之间的差异,但它们未能直接指导模型在何处集中学习固有特征。为了克服这一局限,本文提出了一种为模型提供明确的空间指导的方法,该指导表示固有特征的区域。我们首先通过研究偏重(BA)样本和有偏见(BC)样本之间的类区分共同特征来确定固有特征。接下来,我们在BA样本中增强那些相对于预测相对较少被利用的固有特征。为了在没有偏见信息的情况下构建偏置对比对,我们引入了一个基于有偏模型的偏见负分数,用于将BC样本与BA样本区分开来。实验证明,我们的方法在具有各种程度偏差严重性的合成和真实世界数据上实现了最先进的性能。
https://arxiv.org/abs/2404.19250
Recently, deep learning models have achieved excellent performance in hyperspectral image (HSI) classification. Among the many deep models, Transformer has gradually attracted interest for its excellence in modeling the long-range dependencies of spatial-spectral features in HSI. However, Transformer has the problem of quadratic computational complexity due to the self-attention mechanism, which is heavier than other models and thus has limited adoption in HSI processing. Fortunately, the recently emerging state space model-based Mamba shows great computational efficiency while achieving the modeling power of Transformers. Therefore, in this paper, we make a preliminary attempt to apply the Mamba to HSI classification, leading to the proposed spectral-spatial Mamba (SS-Mamba). Specifically, the proposed SS-Mamba mainly consists of spectral-spatial token generation module and several stacked spectral-spatial Mamba blocks. Firstly, the token generation module converts any given HSI cube to spatial and spectral tokens as sequences. And then these tokens are sent to stacked spectral-spatial mamba blocks (SS-MB). Each SS-MB block consists of two basic mamba blocks and a spectral-spatial feature enhancement module. The spatial and spectral tokens are processed separately by the two basic mamba blocks, respectively. Besides, the feature enhancement module modulates spatial and spectral tokens using HSI sample's center region information. In this way, the spectral and spatial tokens cooperate with each other and achieve information fusion within each block. The experimental results conducted on widely used HSI datasets reveal that the proposed model achieves competitive results compared with the state-of-the-art methods. The Mamba-based method opens a new window for HSI classification.
近年来,在超光谱图像(HSI)分类中,深度学习模型已经取得了非常好的性能。在众多深度模型中,Transformer因其 在HSI中建模空间光谱特征的长距离依赖而逐渐受到关注。然而,由于自注意力机制,Transformer 的计算复杂度为二次方,这使得它对HSI处理的应用受到了限制。幸运的是,基于状态空间模型的Mamba模型在计算效率上表现出色,同时具有Transformer的建模能力。因此,在本文中,我们尝试将Mamba应用于HSI分类,导致提出了光谱空间Mamba(SS-Mamba)。具体来说,SS-Mamba主要由光谱空间令牌生成模块和几个堆叠的光谱空间Mamba块组成。首先,令牌生成模块将任意HSI立方体转换为空间和光谱令牌序列。然后,这些令牌被发送到堆叠的光谱空间Mamba块(SS-MB)。每个SS-MB块由两个基本Mamba块和一个光谱空间特征增强模块组成。空间和光谱令牌分别通过基本Mamba块进行处理。此外,特征增强模块通过HSI样本中心区域信息对空间和光谱令牌进行调整。以这种方式,空间和光谱令牌相互合作,在每个块内实现信息融合。在广泛使用的HSI数据集上进行的实验结果表明,与最先进的方法相比,所提出的模型具有竞争力的结果。基于Mamba的方法为HSI分类打开了一个新的窗口。
https://arxiv.org/abs/2404.18401
Land cover analysis using hyperspectral images (HSI) remains an open problem due to their low spatial resolution and complex spectral information. Recent studies are primarily dedicated to designing Transformer-based architectures for spatial-spectral long-range dependencies modeling, which is computationally expensive with quadratic complexity. Selective structured state space model (Mamba), which is efficient for modeling long-range dependencies with linear complexity, has recently shown promising progress. However, its potential in hyperspectral image processing that requires handling numerous spectral bands has not yet been explored. In this paper, we innovatively propose S$^2$Mamba, a spatial-spectral state space model for hyperspectral image classification, to excavate spatial-spectral contextual features, resulting in more efficient and accurate land cover analysis. In S$^2$Mamba, two selective structured state space models through different dimensions are designed for feature extraction, one for spatial, and the other for spectral, along with a spatial-spectral mixture gate for optimal fusion. More specifically, S$^2$Mamba first captures spatial contextual relations by interacting each pixel with its adjacent through a Patch Cross Scanning module and then explores semantic information from continuous spectral bands through a Bi-directional Spectral Scanning module. Considering the distinct expertise of the two attributes in homogenous and complicated texture scenes, we realize the Spatial-spectral Mixture Gate by a group of learnable matrices, allowing for the adaptive incorporation of representations learned across different dimensions. Extensive experiments conducted on HSI classification benchmarks demonstrate the superiority and prospect of S$^2$Mamba. The code will be available at: this https URL.
使用 hyperspectral 图像(HSI)进行土地覆盖分析仍然是一个开放问题,因为它们的低空间分辨率和高复杂度。最近的研究主要集中在设计基于 Transformer 的空间-光谱长距离依赖关系建模架构,这具有计算上的平方复杂度。选择性结构化状态空间模型(Mamba)在模擬长距离依赖关系方面具有较低的复杂度,并且最近已经展示了有希望的前进方向。然而,它对超分辨率图像处理所需的空间-光谱上下文特征的潜在能力尚未被探索。在本文中,我们创新性地提出了 S$^2$Mamba,一种用于超分辨率图像分类的空间-光谱状态空间模型,以揭示空间-光谱上下文特征,从而实现更高效和准确的陆地覆盖分析。在 S$^2$Mamba 中,通过不同的维度设计两个选择性结构化状态空间模型进行特征提取,一个用于空间信息,另一个用于光谱信息,并使用空间-光谱混合门进行最优融合。具体来说,S$^2$Mamba 首先通过 Patch Cross Scanning 模块与相邻像素相互作用,捕捉空间上下文关系,然后通过 Bi-Directional Spectral Scanning 模块探索连续光谱带中的语义信息。考虑到两种属性在均匀和复杂纹理场景中的独特 expertise,我们通过一组可学习矩阵实现了空间-光谱混合门,允许在不同维度上学习到的表示的适应性整合。在 HSI 分类基准上进行的大量实验证明 S$^2$Mamba 的优越性和前景。代码将在此处提供:https:// this URL。
https://arxiv.org/abs/2404.18213
Discriminative deep learning models with a linear+softmax final layer have a problem: the latent space only predicts the conditional probabilities $p(Y|X)$ but not the full joint distribution $p(Y,X)$, which necessitates a generative approach. The conditional probability cannot detect outliers, causing outlier sensitivity in softmax networks. This exacerbates model over-confidence impacting many problems, such as hallucinations, confounding biases, and dependence on large datasets. To address this we introduce a novel embedding constraint based on the Method of Moments (MoM). We investigate the use of polynomial moments ranging from 1st through 4th order hyper-covariance matrices. Furthermore, we use this embedding constraint to train an Axis-Aligned Gaussian Mixture Model (AAGMM) final layer, which learns not only the conditional, but also the joint distribution of the latent space. We apply this method to the domain of semi-supervised image classification by extending FlexMatch with our technique. We find our MoM constraint with the AAGMM layer is able to match the reported FlexMatch accuracy, while also modeling the joint distribution, thereby reducing outlier sensitivity. We also present a preliminary outlier detection strategy based on Mahalanobis distance and discuss future improvements to this strategy. Code is available at: \url{this https URL}
具有线性+软度的最后层区分性深度学习模型的一个问题在于:潜在空间仅预测条件概率$p(Y|X)$,而不能预测完整的联合分布$p(Y,X)$,这就需要采用生成方法。条件概率无法检测到异常值,导致软度网络的异常敏感性。这加剧了过度自信的影响,影响了诸如幻觉、混淆偏见和依赖大数据集等问题。为解决这个问题,我们引入了一种基于方法 of moments(MoM)的新嵌入约束。我们研究了使用多项式时刻的线性组合。此外,我们还使用此嵌入约束训练了一个轴向对齐高斯混合模型(AAGMM)的最终层,该模型不仅学习条件概率,还学习潜在空间的联合分布。我们将这种方法应用于半监督图像分类领域,通过扩展我们的技术实现FlexMatch。我们发现,与AAGMM层一起使用MoM嵌入约束能够匹配报告的FlexMatch准确率,同时建模联合分布,从而降低异常敏感性。我们还提出了一个初步的异常检测策略基于Mahalanobis距离,并讨论了未来改进此策略的可能性。代码可在此处下载:\url{这个链接}
https://arxiv.org/abs/2404.17978
CLIP showcases exceptional cross-modal matching capabilities due to its training on image-text contrastive learning tasks. However, without specific optimization for unimodal scenarios, its performance in single-modality feature extraction might be suboptimal. Despite this, some studies have directly used CLIP's image encoder for tasks like few-shot classification, introducing a misalignment between its pre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image's feature representation, adversely affecting CLIP's effectiveness in target tasks. In this paper, we view text features as precise neighbors of image features in CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP's pre-training objectives, thereby fully leveraging CLIP's robust cross-modal capabilities. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. We introduce the Auto Text Generator(ATG) to automatically generate the required texts in a data-free and training-free manner. We apply CODER to CLIP's zero-shot and few-shot image classification tasks. Experiment results across various datasets and models confirm CODER's effectiveness. Code is available at:this https URL.
CLIP 通过在图像-文本对差学习任务上的训练展示了出色的跨模态匹配能力。然而,如果没有为单模态场景进行特定的优化,CLIP 在单模态特征提取方面的性能可能无法达到最优。尽管如此,一些研究表明,直接使用 CLIP 的图像编码器来执行少样本分类等任务,会导致其预训练目标与特征提取方法之间存在错位。这种不一致性可能会削弱图像特征的代表性,从而对 CLIP 在目标任务上的效果产生不利影响。在本文中,我们将文本特征视为 CLIP 空间中图像特征的精确邻近点,并提出了一个新的 Cross-Modal Neighbor Representation(CODER),基于图像与它们邻居文本之间的距离结构。这种特征提取方法更接近 CLIP 的预训练目标,从而充分利用了 CLIP 的稳健跨模态能力。构建高质量的 CODER 的关键在于如何创建大量高质量和多样性的文本以与图像匹配。我们引入了自动文本生成器(ATG)以以无需数据和训练的方式生成所需文本。我们将 CODER 应用于 CLIP 的零样本和少样本图像分类任务。各种数据集和模型的实验结果证实了 CODER 的有效性。代码可在此处访问:https://this URL。
https://arxiv.org/abs/2404.17753
Digital pathology and the integration of artificial intelligence (AI) models have revolutionized histopathology, opening new opportunities. With the increasing availability of Whole Slide Images (WSIs), there's a growing demand for efficient retrieval, processing, and analysis of relevant images from vast biomedical archives. However, processing WSIs presents challenges due to their large size and content complexity. Full computer digestion of WSIs is impractical, and processing all patches individually is prohibitively expensive. In this paper, we propose an unsupervised patching algorithm, Sequential Patching Lattice for Image Classification and Enquiry (SPLICE). This novel approach condenses a histopathology WSI into a compact set of representative patches, forming a "collage" of WSI while minimizing redundancy. SPLICE prioritizes patch quality and uniqueness by sequentially analyzing a WSI and selecting non-redundant representative features. We evaluated SPLICE for search and match applications, demonstrating improved accuracy, reduced computation time, and storage requirements compared to existing state-of-the-art methods. As an unsupervised method, SPLICE effectively reduces storage requirements for representing tissue images by 50%. This reduction enables numerous algorithms in computational pathology to operate much more efficiently, paving the way for accelerated adoption of digital pathology.
数字病理学和解剖学与人工智能(AI)模型的结合已经推动了病理学的发展,带来了新的机遇。随着 whole slide images(WSIs)可用性的增加,对从广泛的生物医学档案中检索相关图像的高效检索、处理和分析的需求不断增加。然而,由于 WSIs 的较大尺寸和复杂性,处理 WSIs 存在挑战。完全计算机消化 WSIs 是不切实际的,而单独处理每个补丁的开销是昂贵的。在本文中,我们提出了一个自适应补丁算法:顺序补丁 lattice for image classification and enquiry(SPLICE)。这种新颖的方法将病理学 WSI 压缩成一系列具有代表性的补丁,形成了一个“拼贴画”式的 WSI,同时最小化冗余。SPLICE 通过顺序分析 WSI 和选择非冗余的代表性特征来优先考虑补丁的质量和大。我们对 SPLICE 进行了搜索和匹配应用的评估,证明了与现有最先进方法相比,准确率得到提高、计算时间得到缩短,存储要求降低。作为无监督的方法,SPLICE 有效地将存储WSI 的需求降低了50%。这一降低使得许多计算病理学算法能够更有效地运行,为加速数字病理学的发展铺平道路。
https://arxiv.org/abs/2404.17704
AI systems rely on extensive training on large datasets to address various tasks. However, image-based systems, particularly those used for demographic attribute prediction, face significant challenges. Many current face image datasets primarily focus on demographic factors such as age, gender, and skin tone, overlooking other crucial facial attributes like hairstyle and accessories. This narrow focus limits the diversity of the data and consequently the robustness of AI systems trained on them. This work aims to address this limitation by proposing a methodology for generating synthetic face image datasets that capture a broader spectrum of facial diversity. Specifically, our approach integrates a systematic prompt formulation strategy, encompassing not only demographics and biometrics but also non-permanent traits like make-up, hairstyle, and accessories. These prompts guide a state-of-the-art text-to-image model in generating a comprehensive dataset of high-quality realistic images and can be used as an evaluation set in face analysis systems. Compared to existing datasets, our proposed dataset proves equally or more challenging in image classification tasks while being much smaller in size.
翻译:AI系统通过在大数据集上进行广泛的训练来解决各种任务,但基于图像的系统,尤其是用于人口属性预测的系统,面临着显著的挑战。许多当前的人脸图像数据集主要关注人口因素,如年龄、性别和肤色,而忽略了其他关键的面部特征,如发型和饰品。这种狭窄的聚焦限制了数据的多样性,从而降低了训练在它们上的AI系统的稳健性。这项工作旨在通过提出一种生成合成面部图像数据的方法来解决这一限制,该方法涵盖了更广泛的面部特征,包括 demographic 和生物特征,以及 non-permanent 特征如化妆、发型和饰品。这些提示指导了最先进的文本转图像模型生成全面的高质量人脸图像数据集,可以作为面部分析系统中的评估集。与现有数据集相比,我们提出的数据集在图像分类任务上同样具有挑战性,尽管在规模上更小。
https://arxiv.org/abs/2404.17255
Model Weight Averaging (MWA) is a technique that seeks to enhance model's performance by averaging the weights of multiple trained models. This paper first empirically finds that 1) the vanilla MWA can benefit the class-imbalanced learning, and 2) performing model averaging in the early epochs of training yields a greater performance improvement than doing that in later epochs. Inspired by these two observations, in this paper we propose a novel MWA technique for class-imbalanced learning tasks named Iterative Model Weight Averaging (IMWA). Specifically, IMWA divides the entire training stage into multiple episodes. Within each episode, multiple models are concurrently trained from the same initialized model weight, and subsequently averaged into a singular model. Then, the weight of this average model serves as a fresh initialization for the ensuing episode, thus establishing an iterative learning paradigm. Compared to vanilla MWA, IMWA achieves higher performance improvements with the same computational cost. Moreover, IMWA can further enhance the performance of those methods employing EMA strategy, demonstrating that IMWA and EMA can complement each other. Extensive experiments on various class-imbalanced learning tasks, i.e., class-imbalanced image classification, semi-supervised class-imbalanced image classification and semi-supervised object detection tasks showcase the effectiveness of our IMWA.
模型加权平均(MWA)是一种通过平均多个训练模型的权重来提高模型性能的技术。本文首先通过实验实证发现,1)普通MWA对类不平衡学习有利,2)在训练早期进行模型平均比在训练后期进行模型平均效果更好。受到这两个观察结果的启发,本文提出了一种名为迭代模型权重平均(IMWA)的新MWA技术,用于类不平衡学习任务。具体来说,IMWA将整个训练阶段划分为多个 episode。在每个 episode 中,多个模型从相同的初始模型权重并行训练,然后将这些模型的权重平均成一个单一的模型。接着,这个平均模型的权重成为后续episode的初始化,从而建立了一个迭代学习范式。与普通MWA相比,IMWA在相同的计算成本下实现了更高的性能改进。此外,IMWA还可以进一步增强使用EMA策略的方法的性能,表明IMWA和EMA可以相互补充。在各种类不平衡学习任务上(即类不平衡图像分类、半监督类不平衡图像分类和半监督目标检测任务)的广泛实验表明了IMWA的有效性。
https://arxiv.org/abs/2404.16331
Pooling layers (e.g., max and average) may overlook important information encoded in the spatial arrangement of pixel intensity and/or feature values. We propose a novel lacunarity pooling layer that aims to capture the spatial heterogeneity of the feature maps by evaluating the variability within local windows. The layer operates at multiple scales, allowing the network to adaptively learn hierarchical features. The lacunarity pooling layer can be seamlessly integrated into any artificial neural network architecture. Experimental results demonstrate the layer's effectiveness in capturing intricate spatial patterns, leading to improved feature extraction capabilities. The proposed approach holds promise in various domains, especially in agricultural image analysis tasks. This work contributes to the evolving landscape of artificial neural network architectures by introducing a novel pooling layer that enriches the representation of spatial features. Our code is publicly available.
池化层(例如,最大和平均池化层)可能忽略了像素强度和/或特征值空间排列中编码的重要信息。我们提出了一种新型的局部化池化层,旨在通过评估局部窗口内的方差来捕捉特征图的空间异质性。该层在多个尺度上运行,允许网络自适应地学习层次特征。局部化池化层可以轻松地集成到任何人工神经网络架构中。实验结果表明,该层有效地捕捉了复杂的空间模式,从而提高了特征提取能力。与农业图像分析任务相关的各种领域都具有重要意义。通过引入一种新颖的池化层,丰富了空间特征的表示,为人工神经网络架构的发展做出了贡献。我们的代码是公开可用的。
https://arxiv.org/abs/2404.16268
The recent prevalence of publicly accessible, large medical imaging datasets has led to a proliferation of artificial intelligence (AI) models for cardiovascular image classification and analysis. At the same time, the potentially significant impacts of these models have motivated the development of a range of explainable AI (XAI) methods that aim to explain model predictions given certain image inputs. However, many of these methods are not developed or evaluated with domain experts, and explanations are not contextualized in terms of medical expertise or domain knowledge. In this paper, we propose a novel framework and python library, MiMICRI, that provides domain-centered counterfactual explanations of cardiovascular image classification models. MiMICRI helps users interactively select and replace segments of medical images that correspond to morphological structures. From the counterfactuals generated, users can then assess the influence of each segment on model predictions, and validate the model against known medical facts. We evaluate this library with two medical experts. Our evaluation demonstrates that a domain-centered XAI approach can enhance the interpretability of model explanations, and help experts reason about models in terms of relevant domain knowledge. However, concerns were also surfaced about the clinical plausibility of the counterfactuals generated. We conclude with a discussion on the generalizability and trustworthiness of the MiMICRI framework, as well as the implications of our findings on the development of domain-centered XAI methods for model interpretability in healthcare contexts.
近年来,公开可获取的大型医疗影像数据集的普及导致了许多心血管图像分类和分析的人工智能(AI)模型的出现。与此同时,这些模型的潜在影响也促使开发了一系列可解释AI(XAI)方法,旨在解释给定图像输入的模型预测。然而,许多这些方法都没有经过领域专家的开发或评估,并且解释没有针对医疗专业知识或领域知识进行contextual化。在本文中,我们提出了一个新颖的框架和Python库,MiMICRI,为心血管图像分类模型的领域中心反事实解释提供支持。MiMICRI使用户可以交互式选择和替换医学图像中与形态结构对应的区域。从反事实产生的结果中,用户可以 then评估每个片段对模型预测的影响,并验证模型是否符合已知医疗事实。我们对这个库进行了两个医疗专家的评估。我们的评估表明,以领域为中心的XAI方法可以增强模型解释的可解释性,并帮助专家在相关领域知识的基础上对模型进行推理。然而,也担忧反事实产生的临床可解释性。我们得出结论,MiMICRI框架的可解释性和可靠性,以及我们的研究结果对 healthcare 环境中模型可解释性发展的影响,都存在一定的意义。
https://arxiv.org/abs/2404.16174
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at this https URL.
对比性语言-图像预训练(CLIP)的成功取决于图像与摘要之间的配对监督,而这类数据往往存在噪声。我们提出了混合数据专家(MoDE)方法并通过聚类学习系统。每个数据专家在一个数据聚类上进行训练,对其他聚类的虚假负噪声更不敏感。在推理时,我们通过任务元数据与聚类条件的关联来应用权重。为了精确估计相关性,一个聚类的样本应该在语义上相似,但数据专家的数量仍应保持在训练和推理的合理范围内。因此,我们在人类语言的语义层次上考虑元数据,并建议在粗粒度层面使用细粒度聚类中心来表示每个数据专家。实验研究表明,在ViT-B/16上,四个CLIP数据专家超过了ViT-L/14上的OpenAI CLIP和OpenCLIP在零散图像分类上的表现,但训练成本较低(<35%)。与此同时,MoDE可以异步训练所有数据专家,并可以灵活地包括新的数据专家。代码可在此处下载:https://thisurl.com
https://arxiv.org/abs/2404.16030
Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. The most recent UDA methods always resort to adversarial training to yield state-of-the-art results and a dominant number of existing UDA methods employ convolutional neural networks (CNNs) as feature extractors to learn domain invariant features. Vision transformer (ViT) has attracted tremendous attention since its emergence and has been widely used in various computer vision tasks, such as image classification, object detection, and semantic segmentation, yet its potential in adversarial domain adaptation has never been investigated. In this paper, we fill this gap by employing the ViT as the feature extractor in adversarial domain adaptation. Moreover, we empirically demonstrate that ViT can be a plug-and-play component in adversarial domain adaptation, which means directly replacing the CNN-based feature extractor in existing UDA methods with the ViT-based feature extractor can easily obtain performance improvement. The code is available at this https URL.
无监督领域适应(UDA)的目的是将来自标注源域的知识转移到未标注的目标域。最最新的UDA方法总是依赖于对抗性训练以获得最先进的成果和主导数量现有的UDA方法采用卷积神经网络(CNN)作为特征提取器来学习域不变的特征。自Vision Transformer(ViT) emergence以来,已经引起了巨大的关注,并在各种计算机视觉任务中得到了广泛应用,然而其在对抗领域适应性的潜在能量从未被研究。在本文中,我们通过将ViT作为对抗领域适应的特征提取器来填补这一空白。此外,我们通过实验实证证明,ViT可以成为对抗领域适应的一个插件,这意味着在现有UDA方法中,将基于CNN的特征提取器直接替换为ViT的特征提取器可以轻松获得性能提升。代码可在此处下载:https://url.com/
https://arxiv.org/abs/2404.15817
The integration of deep learning based systems in clinical practice is often impeded by challenges rooted in limited and heterogeneous medical datasets. In addition, prioritization of marginal performance improvements on a few, narrowly scoped benchmarks over clinical applicability has slowed down meaningful algorithmic progress. This trend often results in excessive fine-tuning of existing methods to achieve state-of-the-art performance on selected datasets rather than fostering clinically relevant innovations. In response, this work presents a comprehensive benchmark for the MedMNIST+ database to diversify the evaluation landscape and conduct a thorough analysis of common convolutional neural networks (CNNs) and Transformer-based architectures, for medical image classification. Our evaluation encompasses various medical datasets, training methodologies, and input resolutions, aiming to reassess the strengths and limitations of widely used model variants. Our findings suggest that computationally efficient training schemes and modern foundation models hold promise in bridging the gap between expensive end-to-end training and more resource-refined approaches. Additionally, contrary to prevailing assumptions, we observe that higher resolutions may not consistently improve performance beyond a certain threshold, advocating for the use of lower resolutions, particularly in prototyping stages, to expedite processing. Notably, our analysis reaffirms the competitiveness of convolutional models compared to ViT-based architectures emphasizing the importance of comprehending the intrinsic capabilities of different model architectures. Moreover, we hope that our standardized evaluation framework will help enhance transparency, reproducibility, and comparability on the MedMNIST+ dataset collection as well as future research within the field. Code will be released soon.
深度学习在临床实践中集成往往受到基于有限和异质医疗数据集的挑战的阻碍。此外,在关注几个狭窄的基准上优先改善边缘性能的度量导致在临床应用上的实质性算法进步减缓。这种趋势通常导致在现有方法上进行过度的微调,以在选定的数据集上实现最先进的性能,而不是促进与临床相关的创新。因此,本文提出了一个全面的基准,为 MedMNIST+ 数据库提供多样性,对常见的卷积神经网络(CNN)和基于 Transformer 的架构进行深入分析,以提高医学图像分类的临床相关性。我们的评估包括各种医疗数据集、训练方法和技术,旨在重新评估广泛使用的模型变体。我们的研究结果表明,计算高效的训练方案和现代基础模型有望弥合昂贵端到端训练和更精简的资源优化方法之间的差距。此外,与普遍假设相反,我们观察到,在某些阈值以上,更高的分辨率并不一定改善性能,我们主张在原型阶段使用较低的分辨率,特别是加快处理速度。值得注意的是,我们的分析证实了卷积模型相对于基于 ViT 的架构具有竞争力,突出了理解不同模型架构的固有能力的的重要性。此外,我们希望,我们的标准化评估框架将有助于增强 MedMNIST+ 数据集收集的透明度、可重复性和可比性,同时提高该领域未来的研究水平。代码即将发布。
https://arxiv.org/abs/2404.15786
Single-model systems often suffer from deficiencies in tasks such as speaker verification (SV) and image classification, relying heavily on partial prior knowledge during decision-making, resulting in suboptimal performance. Although multi-model fusion (MMF) can mitigate some of these issues, redundancy in learned representations may limits improvements. To this end, we propose an adversarial complementary representation learning (ACoRL) framework that enables newly trained models to avoid previously acquired knowledge, allowing each individual component model to learn maximally distinct, complementary representations. We make three detailed explanations of why this works and experimental results demonstrate that our method more efficiently improves performance compared to traditional MMF. Furthermore, attribution analysis validates the model trained under ACoRL acquires more complementary knowledge, highlighting the efficacy of our approach in enhancing efficiency and robustness across tasks.
单模型系统通常在诸如演讲验证(SV)和图像分类等任务中存在不足,因此在决策过程中严重依赖先验知识,导致性能较低。尽管多模型融合(MMF)可以在一定程度上减轻这些问题,但学习到的表示的冗余可能限制了提高。为此,我们提出了一个对抗性互补表示学习(ACoRL)框架,使新训练的模型能够避免之前获得的知识,使得每个组件模型能够学习到最独特的互补表示。我们详细解释了这种方法的工作原理,并进行了实验验证,表明与传统MMF相比,我们的方法能更有效地提高性能。此外,归因分析证实,在ACoRL框架下训练的模型获得了更多的互补知识,这表明我们的方法在提高任务效率和鲁棒性方面具有有效性。
https://arxiv.org/abs/2404.15704
In traditional statistical learning, data points are usually assumed to be independently and identically distributed (i.i.d.) following an unknown probability distribution. This paper presents a contrasting viewpoint, perceiving data points as interconnected and employing a Markov reward process (MRP) for data modeling. We reformulate the typical supervised learning as an on-policy policy evaluation problem within reinforcement learning (RL), introducing a generalized temporal difference (TD) learning algorithm as a resolution. Theoretically, our analysis draws connections between the solutions of linear TD learning and ordinary least squares (OLS). We also show that under specific conditions, particularly when noises are correlated, the TD's solution proves to be a more effective estimator than OLS. Furthermore, we establish the convergence of our generalized TD algorithms under linear function approximation. Empirical studies verify our theoretical results, examine the vital design of our TD algorithm and show practical utility across various datasets, encompassing tasks such as regression and image classification with deep learning.
在传统统计学习过程中,数据点通常被假定服从一个未知的概率分布,且为独立且等距分布(i.i.d.)。本文提出了一种不同的观点,将数据点视为相互连接的,并采用马尔可夫奖励过程(MRP)进行数据建模。我们将典型的监督学习视为强化学习(RL)中的策略评估问题,并引入了一个泛化时间差(TD)学习算法作为解决方案。从理论上讲,我们的分析将线性TD学习和普通最小二乘(OLS)的解决方案联系起来。我们还证明了在特定条件下,特别是噪声相关的情况下,TD的解决方案证明比OLS更有效。此外,我们还建立了我们的泛化TD算法的收敛性。实证研究证实了我们的理论结果,检查了TD算法的关键设计,并表明其在各种数据集上的实际应用具有价值,包括使用深度学习进行回归和图像分类等任务。
https://arxiv.org/abs/2404.15518
Capsule networks are a type of neural network that identify image parts and form the instantiation parameters of a whole hierarchically. The goal behind the network is to perform an inverse computer graphics task, and the network parameters are the mapping weights that transform parts into a whole. The trainability of capsule networks in complex data with high intra-class or intra-part variation is challenging. This paper presents a multi-prototype architecture for guiding capsule networks to represent the variations in the image parts. To this end, instead of considering a single capsule for each class and part, the proposed method employs several capsules (co-group capsules), capturing multiple prototypes of an object. In the final layer, co-group capsules compete, and their soft output is considered the target for a competitive cross-entropy loss. Moreover, in the middle layers, the most active capsules map to the next layer with a shared weight among the co-groups. Consequently, due to the reduction in parameters, implicit weight-sharing makes it possible to have more deep capsule network layers. The experimental results on MNIST, SVHN, C-Cube, CEDAR, MCYT, and UTSig datasets reveal that the proposed model outperforms others regarding image classification accuracy.
胶囊网络是一种类型的神经网络,用于识别图像部分并构建整体层次结构。该网络的目标是执行反向计算机图形任务,网络参数是转换部分为整体的部分映射权重。在复杂数据中,胶囊网络的可训练性具有挑战性。本文提出了一个指导胶囊网络表示图像部分变化的多原型架构。为此,我们采用了几个胶囊(共同组胶囊),捕捉了对象多个原型。在最后一层,共同组胶囊竞争,其软输出被认为是竞争交叉熵损失的目标。此外,在中间层,最活跃的胶囊映射到下一个层,共享权重在共同组之间。因此,由于参数减少,隐含权重共享使得具有更多的深胶囊网络层成为可能。在MNIST、SVHN、C-Cube、CEDAR、MCYT和UTSig等数据集的实验结果表明,与现有模型相比,所提出的模型在图像分类准确性方面表现优异。
https://arxiv.org/abs/2404.15445
Multimodal medical imaging plays a pivotal role in clinical diagnosis and research, as it combines information from various imaging modalities to provide a more comprehensive understanding of the underlying pathology. Recently, deep learning-based multimodal fusion techniques have emerged as powerful tools for improving medical image classification. This review offers a thorough analysis of the developments in deep learning-based multimodal fusion for medical classification tasks. We explore the complementary relationships among prevalent clinical modalities and outline three main fusion schemes for multimodal classification networks: input fusion, intermediate fusion (encompassing single-level fusion, hierarchical fusion, and attention-based fusion), and output fusion. By evaluating the performance of these fusion techniques, we provide insight into the suitability of different network architectures for various multimodal fusion scenarios and application domains. Furthermore, we delve into challenges related to network architecture selection, handling incomplete multimodal data management, and the potential limitations of multimodal fusion. Finally, we spotlight the promising future of Transformer-based multimodal fusion techniques and give recommendations for future research in this rapidly evolving field.
多模态医疗影像在临床诊断和研究中扮演着至关重要的角色,因为它结合了各种影像模态的信息,提供更全面的病理解剖学理解。近年来,基于深度学习的多模态融合技术已成为提高医学图像分类的强大工具。本文对基于深度学习的多模态融合在医学分类任务的发展进行了全面的分析。我们探讨了主要临床模态之间的互补关系,并提出了三种主要的融合方案:输入融合、中间融合(包括单层融合、层次融合和基于注意力的融合)和输出融合。通过评估这些融合技术的性能,我们提供了对各种多模态融合场景和应用领域的适用网络架构的洞察。此外,我们还深入探讨了与网络架构选择、处理不完整的多模态数据管理以及多模态融合的潜在限制相关的问题。最后,我们重点关注了基于Transformer的多模态融合技术的光明未来,并给未来在这个快速发展的领域的研究提出了建议。
https://arxiv.org/abs/2404.15022