Model Weight Averaging (MWA) is a technique that seeks to enhance model's performance by averaging the weights of multiple trained models. This paper first empirically finds that 1) the vanilla MWA can benefit the class-imbalanced learning, and 2) performing model averaging in the early epochs of training yields a greater performance improvement than doing that in later epochs. Inspired by these two observations, in this paper we propose a novel MWA technique for class-imbalanced learning tasks named Iterative Model Weight Averaging (IMWA). Specifically, IMWA divides the entire training stage into multiple episodes. Within each episode, multiple models are concurrently trained from the same initialized model weight, and subsequently averaged into a singular model. Then, the weight of this average model serves as a fresh initialization for the ensuing episode, thus establishing an iterative learning paradigm. Compared to vanilla MWA, IMWA achieves higher performance improvements with the same computational cost. Moreover, IMWA can further enhance the performance of those methods employing EMA strategy, demonstrating that IMWA and EMA can complement each other. Extensive experiments on various class-imbalanced learning tasks, i.e., class-imbalanced image classification, semi-supervised class-imbalanced image classification and semi-supervised object detection tasks showcase the effectiveness of our IMWA.
模型加权平均(MWA)是一种通过平均多个训练模型的权重来提高模型性能的技术。本文首先通过实验实证发现,1)普通MWA对类不平衡学习有利,2)在训练早期进行模型平均比在训练后期进行模型平均效果更好。受到这两个观察结果的启发,本文提出了一种名为迭代模型权重平均(IMWA)的新MWA技术,用于类不平衡学习任务。具体来说,IMWA将整个训练阶段划分为多个 episode。在每个 episode 中,多个模型从相同的初始模型权重并行训练,然后将这些模型的权重平均成一个单一的模型。接着,这个平均模型的权重成为后续episode的初始化,从而建立了一个迭代学习范式。与普通MWA相比,IMWA在相同的计算成本下实现了更高的性能改进。此外,IMWA还可以进一步增强使用EMA策略的方法的性能,表明IMWA和EMA可以相互补充。在各种类不平衡学习任务上(即类不平衡图像分类、半监督类不平衡图像分类和半监督目标检测任务)的广泛实验表明了IMWA的有效性。
https://arxiv.org/abs/2404.16331
Pooling layers (e.g., max and average) may overlook important information encoded in the spatial arrangement of pixel intensity and/or feature values. We propose a novel lacunarity pooling layer that aims to capture the spatial heterogeneity of the feature maps by evaluating the variability within local windows. The layer operates at multiple scales, allowing the network to adaptively learn hierarchical features. The lacunarity pooling layer can be seamlessly integrated into any artificial neural network architecture. Experimental results demonstrate the layer's effectiveness in capturing intricate spatial patterns, leading to improved feature extraction capabilities. The proposed approach holds promise in various domains, especially in agricultural image analysis tasks. This work contributes to the evolving landscape of artificial neural network architectures by introducing a novel pooling layer that enriches the representation of spatial features. Our code is publicly available.
池化层(例如,最大和平均池化层)可能忽略了像素强度和/或特征值空间排列中编码的重要信息。我们提出了一种新型的局部化池化层,旨在通过评估局部窗口内的方差来捕捉特征图的空间异质性。该层在多个尺度上运行,允许网络自适应地学习层次特征。局部化池化层可以轻松地集成到任何人工神经网络架构中。实验结果表明,该层有效地捕捉了复杂的空间模式,从而提高了特征提取能力。与农业图像分析任务相关的各种领域都具有重要意义。通过引入一种新颖的池化层,丰富了空间特征的表示,为人工神经网络架构的发展做出了贡献。我们的代码是公开可用的。
https://arxiv.org/abs/2404.16268
The recent prevalence of publicly accessible, large medical imaging datasets has led to a proliferation of artificial intelligence (AI) models for cardiovascular image classification and analysis. At the same time, the potentially significant impacts of these models have motivated the development of a range of explainable AI (XAI) methods that aim to explain model predictions given certain image inputs. However, many of these methods are not developed or evaluated with domain experts, and explanations are not contextualized in terms of medical expertise or domain knowledge. In this paper, we propose a novel framework and python library, MiMICRI, that provides domain-centered counterfactual explanations of cardiovascular image classification models. MiMICRI helps users interactively select and replace segments of medical images that correspond to morphological structures. From the counterfactuals generated, users can then assess the influence of each segment on model predictions, and validate the model against known medical facts. We evaluate this library with two medical experts. Our evaluation demonstrates that a domain-centered XAI approach can enhance the interpretability of model explanations, and help experts reason about models in terms of relevant domain knowledge. However, concerns were also surfaced about the clinical plausibility of the counterfactuals generated. We conclude with a discussion on the generalizability and trustworthiness of the MiMICRI framework, as well as the implications of our findings on the development of domain-centered XAI methods for model interpretability in healthcare contexts.
近年来,公开可获取的大型医疗影像数据集的普及导致了许多心血管图像分类和分析的人工智能(AI)模型的出现。与此同时,这些模型的潜在影响也促使开发了一系列可解释AI(XAI)方法,旨在解释给定图像输入的模型预测。然而,许多这些方法都没有经过领域专家的开发或评估,并且解释没有针对医疗专业知识或领域知识进行contextual化。在本文中,我们提出了一个新颖的框架和Python库,MiMICRI,为心血管图像分类模型的领域中心反事实解释提供支持。MiMICRI使用户可以交互式选择和替换医学图像中与形态结构对应的区域。从反事实产生的结果中,用户可以 then评估每个片段对模型预测的影响,并验证模型是否符合已知医疗事实。我们对这个库进行了两个医疗专家的评估。我们的评估表明,以领域为中心的XAI方法可以增强模型解释的可解释性,并帮助专家在相关领域知识的基础上对模型进行推理。然而,也担忧反事实产生的临床可解释性。我们得出结论,MiMICRI框架的可解释性和可靠性,以及我们的研究结果对 healthcare 环境中模型可解释性发展的影响,都存在一定的意义。
https://arxiv.org/abs/2404.16174
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at this https URL.
对比性语言-图像预训练(CLIP)的成功取决于图像与摘要之间的配对监督,而这类数据往往存在噪声。我们提出了混合数据专家(MoDE)方法并通过聚类学习系统。每个数据专家在一个数据聚类上进行训练,对其他聚类的虚假负噪声更不敏感。在推理时,我们通过任务元数据与聚类条件的关联来应用权重。为了精确估计相关性,一个聚类的样本应该在语义上相似,但数据专家的数量仍应保持在训练和推理的合理范围内。因此,我们在人类语言的语义层次上考虑元数据,并建议在粗粒度层面使用细粒度聚类中心来表示每个数据专家。实验研究表明,在ViT-B/16上,四个CLIP数据专家超过了ViT-L/14上的OpenAI CLIP和OpenCLIP在零散图像分类上的表现,但训练成本较低(<35%)。与此同时,MoDE可以异步训练所有数据专家,并可以灵活地包括新的数据专家。代码可在此处下载:https://thisurl.com
https://arxiv.org/abs/2404.16030
Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. The most recent UDA methods always resort to adversarial training to yield state-of-the-art results and a dominant number of existing UDA methods employ convolutional neural networks (CNNs) as feature extractors to learn domain invariant features. Vision transformer (ViT) has attracted tremendous attention since its emergence and has been widely used in various computer vision tasks, such as image classification, object detection, and semantic segmentation, yet its potential in adversarial domain adaptation has never been investigated. In this paper, we fill this gap by employing the ViT as the feature extractor in adversarial domain adaptation. Moreover, we empirically demonstrate that ViT can be a plug-and-play component in adversarial domain adaptation, which means directly replacing the CNN-based feature extractor in existing UDA methods with the ViT-based feature extractor can easily obtain performance improvement. The code is available at this https URL.
无监督领域适应(UDA)的目的是将来自标注源域的知识转移到未标注的目标域。最最新的UDA方法总是依赖于对抗性训练以获得最先进的成果和主导数量现有的UDA方法采用卷积神经网络(CNN)作为特征提取器来学习域不变的特征。自Vision Transformer(ViT) emergence以来,已经引起了巨大的关注,并在各种计算机视觉任务中得到了广泛应用,然而其在对抗领域适应性的潜在能量从未被研究。在本文中,我们通过将ViT作为对抗领域适应的特征提取器来填补这一空白。此外,我们通过实验实证证明,ViT可以成为对抗领域适应的一个插件,这意味着在现有UDA方法中,将基于CNN的特征提取器直接替换为ViT的特征提取器可以轻松获得性能提升。代码可在此处下载:https://url.com/
https://arxiv.org/abs/2404.15817
The integration of deep learning based systems in clinical practice is often impeded by challenges rooted in limited and heterogeneous medical datasets. In addition, prioritization of marginal performance improvements on a few, narrowly scoped benchmarks over clinical applicability has slowed down meaningful algorithmic progress. This trend often results in excessive fine-tuning of existing methods to achieve state-of-the-art performance on selected datasets rather than fostering clinically relevant innovations. In response, this work presents a comprehensive benchmark for the MedMNIST+ database to diversify the evaluation landscape and conduct a thorough analysis of common convolutional neural networks (CNNs) and Transformer-based architectures, for medical image classification. Our evaluation encompasses various medical datasets, training methodologies, and input resolutions, aiming to reassess the strengths and limitations of widely used model variants. Our findings suggest that computationally efficient training schemes and modern foundation models hold promise in bridging the gap between expensive end-to-end training and more resource-refined approaches. Additionally, contrary to prevailing assumptions, we observe that higher resolutions may not consistently improve performance beyond a certain threshold, advocating for the use of lower resolutions, particularly in prototyping stages, to expedite processing. Notably, our analysis reaffirms the competitiveness of convolutional models compared to ViT-based architectures emphasizing the importance of comprehending the intrinsic capabilities of different model architectures. Moreover, we hope that our standardized evaluation framework will help enhance transparency, reproducibility, and comparability on the MedMNIST+ dataset collection as well as future research within the field. Code will be released soon.
深度学习在临床实践中集成往往受到基于有限和异质医疗数据集的挑战的阻碍。此外,在关注几个狭窄的基准上优先改善边缘性能的度量导致在临床应用上的实质性算法进步减缓。这种趋势通常导致在现有方法上进行过度的微调,以在选定的数据集上实现最先进的性能,而不是促进与临床相关的创新。因此,本文提出了一个全面的基准,为 MedMNIST+ 数据库提供多样性,对常见的卷积神经网络(CNN)和基于 Transformer 的架构进行深入分析,以提高医学图像分类的临床相关性。我们的评估包括各种医疗数据集、训练方法和技术,旨在重新评估广泛使用的模型变体。我们的研究结果表明,计算高效的训练方案和现代基础模型有望弥合昂贵端到端训练和更精简的资源优化方法之间的差距。此外,与普遍假设相反,我们观察到,在某些阈值以上,更高的分辨率并不一定改善性能,我们主张在原型阶段使用较低的分辨率,特别是加快处理速度。值得注意的是,我们的分析证实了卷积模型相对于基于 ViT 的架构具有竞争力,突出了理解不同模型架构的固有能力的的重要性。此外,我们希望,我们的标准化评估框架将有助于增强 MedMNIST+ 数据集收集的透明度、可重复性和可比性,同时提高该领域未来的研究水平。代码即将发布。
https://arxiv.org/abs/2404.15786
Single-model systems often suffer from deficiencies in tasks such as speaker verification (SV) and image classification, relying heavily on partial prior knowledge during decision-making, resulting in suboptimal performance. Although multi-model fusion (MMF) can mitigate some of these issues, redundancy in learned representations may limits improvements. To this end, we propose an adversarial complementary representation learning (ACoRL) framework that enables newly trained models to avoid previously acquired knowledge, allowing each individual component model to learn maximally distinct, complementary representations. We make three detailed explanations of why this works and experimental results demonstrate that our method more efficiently improves performance compared to traditional MMF. Furthermore, attribution analysis validates the model trained under ACoRL acquires more complementary knowledge, highlighting the efficacy of our approach in enhancing efficiency and robustness across tasks.
单模型系统通常在诸如演讲验证(SV)和图像分类等任务中存在不足,因此在决策过程中严重依赖先验知识,导致性能较低。尽管多模型融合(MMF)可以在一定程度上减轻这些问题,但学习到的表示的冗余可能限制了提高。为此,我们提出了一个对抗性互补表示学习(ACoRL)框架,使新训练的模型能够避免之前获得的知识,使得每个组件模型能够学习到最独特的互补表示。我们详细解释了这种方法的工作原理,并进行了实验验证,表明与传统MMF相比,我们的方法能更有效地提高性能。此外,归因分析证实,在ACoRL框架下训练的模型获得了更多的互补知识,这表明我们的方法在提高任务效率和鲁棒性方面具有有效性。
https://arxiv.org/abs/2404.15704
In traditional statistical learning, data points are usually assumed to be independently and identically distributed (i.i.d.) following an unknown probability distribution. This paper presents a contrasting viewpoint, perceiving data points as interconnected and employing a Markov reward process (MRP) for data modeling. We reformulate the typical supervised learning as an on-policy policy evaluation problem within reinforcement learning (RL), introducing a generalized temporal difference (TD) learning algorithm as a resolution. Theoretically, our analysis draws connections between the solutions of linear TD learning and ordinary least squares (OLS). We also show that under specific conditions, particularly when noises are correlated, the TD's solution proves to be a more effective estimator than OLS. Furthermore, we establish the convergence of our generalized TD algorithms under linear function approximation. Empirical studies verify our theoretical results, examine the vital design of our TD algorithm and show practical utility across various datasets, encompassing tasks such as regression and image classification with deep learning.
在传统统计学习过程中,数据点通常被假定服从一个未知的概率分布,且为独立且等距分布(i.i.d.)。本文提出了一种不同的观点,将数据点视为相互连接的,并采用马尔可夫奖励过程(MRP)进行数据建模。我们将典型的监督学习视为强化学习(RL)中的策略评估问题,并引入了一个泛化时间差(TD)学习算法作为解决方案。从理论上讲,我们的分析将线性TD学习和普通最小二乘(OLS)的解决方案联系起来。我们还证明了在特定条件下,特别是噪声相关的情况下,TD的解决方案证明比OLS更有效。此外,我们还建立了我们的泛化TD算法的收敛性。实证研究证实了我们的理论结果,检查了TD算法的关键设计,并表明其在各种数据集上的实际应用具有价值,包括使用深度学习进行回归和图像分类等任务。
https://arxiv.org/abs/2404.15518
Capsule networks are a type of neural network that identify image parts and form the instantiation parameters of a whole hierarchically. The goal behind the network is to perform an inverse computer graphics task, and the network parameters are the mapping weights that transform parts into a whole. The trainability of capsule networks in complex data with high intra-class or intra-part variation is challenging. This paper presents a multi-prototype architecture for guiding capsule networks to represent the variations in the image parts. To this end, instead of considering a single capsule for each class and part, the proposed method employs several capsules (co-group capsules), capturing multiple prototypes of an object. In the final layer, co-group capsules compete, and their soft output is considered the target for a competitive cross-entropy loss. Moreover, in the middle layers, the most active capsules map to the next layer with a shared weight among the co-groups. Consequently, due to the reduction in parameters, implicit weight-sharing makes it possible to have more deep capsule network layers. The experimental results on MNIST, SVHN, C-Cube, CEDAR, MCYT, and UTSig datasets reveal that the proposed model outperforms others regarding image classification accuracy.
胶囊网络是一种类型的神经网络,用于识别图像部分并构建整体层次结构。该网络的目标是执行反向计算机图形任务,网络参数是转换部分为整体的部分映射权重。在复杂数据中,胶囊网络的可训练性具有挑战性。本文提出了一个指导胶囊网络表示图像部分变化的多原型架构。为此,我们采用了几个胶囊(共同组胶囊),捕捉了对象多个原型。在最后一层,共同组胶囊竞争,其软输出被认为是竞争交叉熵损失的目标。此外,在中间层,最活跃的胶囊映射到下一个层,共享权重在共同组之间。因此,由于参数减少,隐含权重共享使得具有更多的深胶囊网络层成为可能。在MNIST、SVHN、C-Cube、CEDAR、MCYT和UTSig等数据集的实验结果表明,与现有模型相比,所提出的模型在图像分类准确性方面表现优异。
https://arxiv.org/abs/2404.15445
Multimodal medical imaging plays a pivotal role in clinical diagnosis and research, as it combines information from various imaging modalities to provide a more comprehensive understanding of the underlying pathology. Recently, deep learning-based multimodal fusion techniques have emerged as powerful tools for improving medical image classification. This review offers a thorough analysis of the developments in deep learning-based multimodal fusion for medical classification tasks. We explore the complementary relationships among prevalent clinical modalities and outline three main fusion schemes for multimodal classification networks: input fusion, intermediate fusion (encompassing single-level fusion, hierarchical fusion, and attention-based fusion), and output fusion. By evaluating the performance of these fusion techniques, we provide insight into the suitability of different network architectures for various multimodal fusion scenarios and application domains. Furthermore, we delve into challenges related to network architecture selection, handling incomplete multimodal data management, and the potential limitations of multimodal fusion. Finally, we spotlight the promising future of Transformer-based multimodal fusion techniques and give recommendations for future research in this rapidly evolving field.
多模态医疗影像在临床诊断和研究中扮演着至关重要的角色,因为它结合了各种影像模态的信息,提供更全面的病理解剖学理解。近年来,基于深度学习的多模态融合技术已成为提高医学图像分类的强大工具。本文对基于深度学习的多模态融合在医学分类任务的发展进行了全面的分析。我们探讨了主要临床模态之间的互补关系,并提出了三种主要的融合方案:输入融合、中间融合(包括单层融合、层次融合和基于注意力的融合)和输出融合。通过评估这些融合技术的性能,我们提供了对各种多模态融合场景和应用领域的适用网络架构的洞察。此外,我们还深入探讨了与网络架构选择、处理不完整的多模态数据管理以及多模态融合的潜在限制相关的问题。最后,我们重点关注了基于Transformer的多模态融合技术的光明未来,并给未来在这个快速发展的领域的研究提出了建议。
https://arxiv.org/abs/2404.15022
Hyperspectral image classification is a challenging task due to the high dimensionality and complex nature of hyperspectral data. In recent years, deep learning techniques have emerged as powerful tools for addressing these challenges. This survey provides a comprehensive overview of the current trends and future prospects in hyperspectral image classification, focusing on the advancements from deep learning models to the emerging use of transformers. We review the key concepts, methodologies, and state-of-the-art approaches in deep learning for hyperspectral image classification. Additionally, we discuss the potential of transformer-based models in this field and highlight the advantages and challenges associated with these approaches. Comprehensive experimental results have been undertaken using three Hyperspectral datasets to verify the efficacy of various conventional deep-learning models and Transformers. Finally, we outline future research directions and potential applications that can further enhance the accuracy and efficiency of hyperspectral image classification. The Source code is available at this https URL.
超分辨率图像分类是一个具有高维度和复杂性的挑战性的任务。近年来,深度学习技术已成为解决这些挑战的强大工具。本文对当前 hyperspectral 图像分类的趋势和未来前景进行全面概述,重点关注从深度学习模型到新兴的 transformer 的应用。我们回顾了用于 hyperspectral 图像分类的深度学习中的关键概念、方法和最先进的策略。此外,我们讨论了基于 transformer 的模型的潜力,并强调了这些方法的优势和挑战。使用三个超分辨率数据集进行了全面实验,以验证各种传统深度学习模型和 transformer 的有效性。最后,我们概述了未来研究的方向和可能的应用,以进一步增强超分辨率图像分类的准确性和效率。源代码可在此处访问:https://www.example.com/。
https://arxiv.org/abs/2404.14955
The traditional Transformer model encounters challenges with variable-length input sequences, particularly in Hyperspectral Image Classification (HSIC), leading to efficiency and scalability concerns. To overcome this, we propose a pyramid-based hierarchical transformer (PyFormer). This innovative approach organizes input data hierarchically into segments, each representing distinct abstraction levels, thereby enhancing processing efficiency for lengthy sequences. At each level, a dedicated transformer module is applied, effectively capturing both local and global context. Spatial and spectral information flow within the hierarchy facilitates communication and abstraction propagation. Integration of outputs from different levels culminates in the final input representation. Experimental results underscore the superiority of the proposed method over traditional approaches. Additionally, the incorporation of disjoint samples augments robustness and reliability, thereby highlighting the potential of our approach in advancing HSIC. The source code is available at this https URL.
传统的Transformer模型在变长输入序列方面遇到了挑战,特别是在 Hyperspectral Image Classification(HSIC)中,导致效率和可扩展性方面的担忧。为了克服这个问题,我们提出了一个金字塔基于的层次Transformer(PyFormer)。这种创新方法将输入数据组织成层次结构,每个层次表示不同的抽象级别,从而提高处理长序列的效率。在每一层,都应用了一个专用的Transformer模块,有效地捕捉了局部和全局上下文。层次结构中的空间和频谱信息流动促进了沟通和抽象传播。不同层次输出的集成导致了最终的输入表示。实验结果证实了与传统方法相比,所提出的方法具有优越性。此外,结合离散样本的增强增强了鲁棒性和可靠性,从而突出了我们在推进 HSIC 方面的潜在能力。源代码可在此处访问:https://www.huaweicloud.com/cloud/models/pyformer/
https://arxiv.org/abs/2404.14945
Disjoint sampling is critical for rigorous and unbiased evaluation of state-of-the-art (SOTA) models. When training, validation, and test sets overlap or share data, it introduces a bias that inflates performance metrics and prevents accurate assessment of a model's true ability to generalize to new examples. This paper presents an innovative disjoint sampling approach for training SOTA models on Hyperspectral image classification (HSIC) tasks. By separating training, validation, and test data without overlap, the proposed method facilitates a fairer evaluation of how well a model can classify pixels it was not exposed to during training or validation. Experiments demonstrate the approach significantly improves a model's generalization compared to alternatives that include training and validation data in test data. By eliminating data leakage between sets, disjoint sampling provides reliable metrics for benchmarking progress in HSIC. Researchers can have confidence that reported performance truly reflects a model's capabilities for classifying new scenes, not just memorized pixels. This rigorous methodology is critical for advancing SOTA models and their real-world application to large-scale land mapping with Hyperspectral sensors. The source code is available at this https URL.
离散采样对于准确和无偏见地评估最先进的(SOTA)模型至关重要。当训练集、验证集和测试集不重叠或共享数据时,它引入了偏差,导致性能指标膨胀,并阻止了对模型在为新实例上进行准确评估。本文提出了一种创新性的离散采样方法,用于在 Hyperspectral image classification (HSIC) 任务上训练 SOTA 模型。通过分离训练集、验证集和测试集,所提出的方法有助于更公平地评估模型在训练集或验证集上从未暴露过的像素的分类能力。实验证明,与包括训练和验证数据在测试集中的替代方法相比,该方法显著提高了模型的泛化能力。通过消除数据集之间的泄漏,离散采样为基于HSIC 的基准测试提供了可靠的度量。研究人员可以放心地相信,所报告的性能反映了模型对分类新场景的能力,而不仅仅是记忆中的像素。这种严谨的方法对于推动 SOTA 模型及其在大型地图应用中的实际应用至关重要。源代码可在此处访问:https://www.osgeo.org/。
https://arxiv.org/abs/2404.14944
Mounting evidence in explainability for artificial intelligence (XAI) research suggests that good explanations should be tailored to individual tasks and should relate to concepts relevant to the task. However, building task specific explanations is time consuming and requires domain expertise which can be difficult to integrate into generic XAI methods. A promising approach towards designing useful task specific explanations with domain experts is based on compositionality of semantic concepts. Here, we present a novel approach that enables domain experts to quickly create concept-based explanations for computer vision tasks intuitively via natural language. Leveraging recent progress in deep generative methods we propose to generate visual concept-based prototypes via text-to-image methods. These prototypes are then used to explain predictions of computer vision models via a simple k-Nearest-Neighbors routine. The modular design of CoProNN is simple to implement, it is straightforward to adapt to novel tasks and allows for replacing the classification and text-to-image models as more powerful models are released. The approach can be evaluated offline against the ground-truth of predefined prototypes that can be easily communicated also to domain experts as they are based on visual concepts. We show that our strategy competes very well with other concept-based XAI approaches on coarse grained image classification tasks and may even outperform those methods on more demanding fine grained tasks. We demonstrate the effectiveness of our method for human-machine collaboration settings in qualitative and quantitative user studies. All code and experimental data can be found in our GitHub $\href{this https URL}{repository}$.
在人工智能(XAI)研究中,将证据适配到具体任务并进行解释是一个好方法,应与任务相关的概念相联系。然而,为任务定制解释需要花费时间,并且需要领域专业知识,这使得将领域专业知识整合到通用XAI方法中变得困难。设计有用的任务特定解释与领域专家合作是一种有前途的方法,基于语义概念的组合性。在这里,我们提出了一种新方法,使领域专家能够通过自然语言直观地创建基于概念的计算机视觉任务的解释。我们利用深度生成方法的最新进展,通过文本转图像方法生成视觉概念基原型。然后,通过简单的k-最近邻算法对计算机视觉模型的预测进行解释。CoProNN的模块化设计简单易用,很容易适应新任务,可以替换分类和文本转图像模型,因为它们基于更强大的模型。我们的策略在粗粒度图像分类任务上与基于概念的其他XAI方法竞争,甚至可能在更细粒度的任务上超过这些方法。我们在定性和定量用户研究中展示了我们方法的 effectiveness,所有代码和实验数据都可以在GitHub上找到。 <https://github.com/your-username>
https://arxiv.org/abs/2404.14830
This paper outlines our submission to the MEDIQA2024 Multilingual and Multimodal Medical Answer Generation (M3G) shared task. We report results for two standalone solutions under the English category of the task, the first involving two consecutive API calls to the Claude 3 Opus API and the second involving training an image-disease label joint embedding in the style of CLIP for image classification. These two solutions scored 1st and 2nd place respectively on the competition leaderboard, substantially outperforming the next best solution. Additionally, we discuss insights gained from post-competition experiments. While the performance of these two solutions have significant room for improvement due to the difficulty of the shared task and the challenging nature of medical visual question answering in general, we identify the multi-stage LLM approach and the CLIP image classification approach as promising avenues for further investigation.
本文概述了我们向MEDIQA2024多语言多模态医疗答案生成(M3G)共享任务提交的论文。我们在任务的英语类别下报告了两个独立解决方案的结果,其中第一个涉及两次连续的API调用到Claude 3 Opus API,第二个涉及以CLIP风格训练图像疾病标签联合嵌入进行图像分类。这两个解决方案在竞赛排行榜上分别获得第一和第二名,远远超过了下一个最好的解决方案。此外,我们讨论了从比赛实验中获得的见解。虽然这两个解决方案由于共享任务的难度和医疗视觉问题回答的挑战性而性能还有很大的提升空间,但我们认为多级LLM方法和CLIP图像分类方法是进一步研究的有前途的途径。
https://arxiv.org/abs/2404.14567
In this paper, we present a simple yet effective contrastive knowledge distillation approach, which can be formulated as a sample-wise alignment problem with intra- and inter-sample constraints. Unlike traditional knowledge distillation methods that concentrate on maximizing feature similarities or preserving class-wise semantic correlations between teacher and student features, our method attempts to recover the "dark knowledge" by aligning sample-wise teacher and student logits. Specifically, our method first minimizes logit differences within the same sample by considering their numerical values, thus preserving intra-sample similarities. Next, we bridge semantic disparities by leveraging dissimilarities across different samples. Note that constraints on intra-sample similarities and inter-sample dissimilarities can be efficiently and effectively reformulated into a contrastive learning framework with newly designed positive and negative pairs. The positive pair consists of the teacher's and student's logits derived from an identical sample, while the negative pairs are formed by using logits from different samples. With this formulation, our method benefits from the simplicity and efficiency of contrastive learning through the optimization of InfoNCE, yielding a run-time complexity that is far less than $O(n^2)$, where $n$ represents the total number of training samples. Furthermore, our method can eliminate the need for hyperparameter tuning, particularly related to temperature parameters and large batch sizes. We conduct comprehensive experiments on three datasets including CIFAR-100, ImageNet-1K, and MS COCO. Experimental results clearly confirm the effectiveness of the proposed method on both image classification and object detection tasks. Our source codes will be publicly available at this https URL.
在本文中,我们提出了一种简单而有效的对比性知识蒸馏方法,可以将其表述为样本层面的对齐问题,具有内部样本和跨样本约束。与传统的知识蒸馏方法不同,该方法试图通过将样本层面的教师和学生的对数值对齐来恢复“暗知识”,具体来说,我们的方法首先通过考虑它们的数值值来最小化同一样本内的对数值差异,从而保留内部样本相似性。接下来,我们通过利用不同样本之间的差异来桥通常的语义差异。需要注意的是,对内部样本相似性和跨样本差异的限制可以有效地转化为一个新的设计的有向二进制对齐学习框架。其中一对正对是由相同的样本生成的教师和学生的对数值,而负对则是由不同样本的推理得出的。通过这种表示方法,我们的方法通过优化InfoNCE实现了对比学习的高效性和效率,其运行时间复杂度远低于$O(n^2)$,其中$n$表示训练样本的总数。此外,我们的方法可以消除关于温度参数和大批量的超参数 tuning需求,特别与温度参数和大的批量大小的相关。我们对包括CIFAR-100、ImageNet-1K和MS COCO在内的三个数据集进行了全面的实验。实验结果明确证实了所提出方法在图像分类和目标检测任务上的有效性。我们的源代码将公开发布在这个https URL上。
https://arxiv.org/abs/2404.14109
Image classification is a fundamental task in computer vision, and the quest to enhance DNN accuracy without inflating model size or latency remains a pressing concern. We make a couple of advances in this regard, leading to a novel EncodeNet design and training framework. The first advancement involves Converting Autoencoders, a novel approach that transforms images into an easy-to-classify image of its class. Our prior work that applied the Converting Autoencoder and a simple classifier in tandem achieved moderate accuracy over simple datasets, such as MNIST and FMNIST. However, on more complex datasets like CIFAR-10, the Converting Autoencoder has a large reconstruction loss, making it unsuitable for enhancing DNN accuracy. To address these limitations, we generalize the design of Converting Autoencoders by leveraging a larger class of DNNs, those with architectures comprising feature extraction layers followed by classification layers. We incorporate a generalized algorithmic design of the Converting Autoencoder and intraclass clustering to identify representative images, leading to optimized image feature learning. Next, we demonstrate the effectiveness of our EncodeNet design and training framework, improving the accuracy of well-trained baseline DNNs while maintaining the overall model size. EncodeNet's building blocks comprise the trained encoder from our generalized Converting Autoencoders transferring knowledge to a lightweight classifier network - also extracted from the baseline DNN. Our experimental results demonstrate that EncodeNet improves the accuracy of VGG16 from 92.64% to 94.05% on CIFAR-10 and RestNet20 from 74.56% to 76.04% on CIFAR-100. It outperforms state-of-the-art techniques that rely on knowledge distillation and attention mechanisms, delivering higher accuracy for models of comparable size.
图像分类是计算机视觉中的一个基本任务,而通过不增加模型大小或延迟来提高DNN准确性仍然是一个迫切需要解决的问题。在这方面,我们做出了一些进展,导致了一种新颖的EncodeNet设计和训练框架。这一进步包括将变压器转换为将图像转换为其类别的容易分类图像的新方法。我们在之前的工作中,将变压器与简单的分类器一起应用,在简单的数据集(如MNIST和FMNIST)上取得了中等准确度。然而,在更复杂的数据集如CIFAR-10上,变压器的重建损失很大,使得它不适合提高DNN准确性。为了克服这些限制,我们通过利用更大类别的DNN,那些由特征提取层 followed by 分类层组成的方法,对变压器的设计进行了扩展。我们将变压器的通用算法设计以及内类聚类引入EncodeNet,以识别具有代表性的图像,从而实现优化图像特征学习。接下来,我们证明了EncodeNet的设计和训练框架的有效性,同时优化了预训练基线DNN的准确性,而模型的整体大小保持不变。EncodeNet的构建模块包括从扩展的变压器中获得的训练好的编码器,以及提取自基线DNN的轻量级分类器网络。我们的实验结果表明,EncodeNet在CIFAR-10和RestNet20上的准确性从92.64%提高到了94.05%,而在CIFAR-100上从74.56%提高到了76.04%。它优于依赖于知识蒸馏和注意机制的先进技术,为具有类似大小的模型提供了更高的准确性。
https://arxiv.org/abs/2404.13770
Online task-free continual learning (OTFCL) is a more challenging variant of continual learning which emphasizes the gradual shift of task boundaries and learns in an online mode. Existing methods rely on a memory buffer composed of old samples to prevent forgetting. However,the use of memory buffers not only raises privacy concerns but also hinders the efficient learning of new samples. To address this problem, we propose a novel framework called I2CANSAY that gets rid of the dependence on memory buffers and efficiently learns the knowledge of new data from one-shot samples. Concretely, our framework comprises two main modules. Firstly, the Inter-Class Analogical Augmentation (ICAN) module generates diverse pseudo-features for old classes based on the inter-class analogy of feature distributions for different new classes, serving as a substitute for the memory buffer. Secondly, the Intra-Class Significance Analysis (ISAY) module analyzes the significance of attributes for each class via its distribution standard deviation, and generates the importance vector as a correction bias for the linear classifier, thereby enhancing the capability of learning from new samples. We run our experiments on four popular image classification datasets: CoRe50, CIFAR-10, CIFAR-100, and CUB-200, our approach outperforms the prior state-of-the-art by a large margin.
在线无任务持续学习(OTFCL)是一种更具挑战性的连续学习变体,它强调了任务边界的逐步转变和以在线方式学习。现有的方法依赖于由旧样本组成的记忆缓冲区来防止遗忘。然而,使用记忆缓冲区不仅引发了隐私问题,而且还会阻碍对新样本的有效学习。为解决这个问题,我们提出了一个名为I2CANSAY的新框架,它消除了对记忆缓冲区的依赖,并有效地从一挥发性样本中学习新数据的知識。具体来说,我们的框架由两个主要模块组成。首先,跨类别相似性增强(ICAN)模块根据不同新类之间的类內聚类相似性生成多样伪特征,作为记忆缓冲区的补充。其次,内类重要性分析(ISAY)模块通过分布标准差分析每个类的属性,并生成修正偏差向量作为线性分类器的增强剂,从而提高从新样本中学习的可能性。我们对四个流行的图像分类数据集:CoRe50,CIFAR-10,CIFAR-100和CUB-200)进行实验,我们的方法在很大程度上超过了先前的 state-of-the-art 水平。
https://arxiv.org/abs/2404.13576
Transformer has been applied in the field of computer vision due to its excellent performance in natural language processing, surpassing traditional convolutional neural networks and achieving new state-of-the-art. ViT divides an image into several local patches, known as "visual sentences". However, the information contained in the image is vast and complex, and focusing only on the features at the "visual sentence" level is not enough. The features between local patches should also be taken into consideration. In order to achieve further improvement, the TNT model is proposed, whose algorithm further divides the image into smaller patches, namely "visual words," achieving more accurate results. The core of Transformer is the Multi-Head Attention mechanism, and traditional attention mechanisms ignore interactions across different attention heads. In order to reduce redundancy and improve utilization, we introduce the nested algorithm and apply the Nested-TNT to image classification tasks. The experiment confirms that the proposed model has achieved better classification performance over ViT and TNT, exceeding 2.25%, 1.1% on dataset CIFAR10 and 2.78%, 0.25% on dataset FLOWERS102 respectively.
由于其在自然语言处理方面的卓越表现,Transformer 在计算机视觉领域得到了广泛应用,超越了传统的卷积神经网络,并达到了最先进的状态。ViT 将图像分割成几个局部区域,称为“视觉句子”。然而,图像中的信息非常丰富和复杂,仅关注“视觉句子”层面的特征是不够的。在局部区域之间也应该考虑特征。为了实现进一步的改进,我们提出了 TNT 模型,其算法将图像进一步分割成更小的区域,即“视觉词”,从而实现更准确的结果。Transformer 的核心是多头注意力机制,而传统的注意力机制忽略了不同注意头之间的交互。为了减少冗余并提高利用率,我们引入了嵌套算法,并将 Nested-TNT 应用于图像分类任务。实验证实,与 ViT 和 TNT 相比,所提出的模型在 CIFAR10 数据集和 FLOWERS102 数据集上获得了更好的分类性能,分别超过 2.25% 和 2.78%。
https://arxiv.org/abs/2404.13434
In recent years, Vision Transformers (ViTs) have shown promising classification performance over Convolutional Neural Networks (CNNs) due to their self-attention mechanism. Many researchers have incorporated ViTs for Hyperspectral Image (HSI) classification. HSIs are characterised by narrow contiguous spectral bands, providing rich spectral data. Although ViTs excel with sequential data, they cannot extract spectral-spatial information like CNNs. Furthermore, to have high classification performance, there should be a strong interaction between the HSI token and the class (CLS) token. To solve these issues, we propose a 3D-Convolution guided Spectral-Spatial Transformer (3D-ConvSST) for HSI classification that utilizes a 3D-Convolution Guided Residual Module (CGRM) in-between encoders to "fuse" the local spatial and spectral information and to enhance the feature propagation. Furthermore, we forego the class token and instead apply Global Average Pooling, which effectively encodes more discriminative and pertinent high-level features for classification. Extensive experiments have been conducted on three public HSI datasets to show the superiority of the proposed model over state-of-the-art traditional, convolutional, and Transformer models. The code is available at this https URL.
近年来,由于自注意力机制,Vision Transformers (ViTs) 在卷积神经网络(CNNs)上表现出了良好的分类性能。许多研究者将ViTs应用于高光谱图像(HSI)分类。HSIs的特点是狭窄的连续频带,提供丰富的光谱数据。尽管ViTs在序列数据上表现出色,但它们无法像CNNs一样提取光谱-空间信息。此外,为了获得高分类性能,HSI令牌与类(CLS)令牌之间应该存在强烈的相互作用。为解决这些问题,我们提出了一个3D卷积引导的高光谱空间Transformer(3D-ConvSST)用于HSI分类,该模型在编码器之间利用3D卷积引导残差模块(CGRM)来“融合”局部空间和光谱信息,并增强特征传播。此外,我们摒弃了类标签,而是应用全局平均池化,这有效地为分类编码更具有区分性和相关性的高级特征。在三个公开的HSI数据集上进行了广泛的实验,以证明与最先进的传统卷积、转换器模型相比,所提出的模型具有优越性。代码可在此https URL上获取。
https://arxiv.org/abs/2404.13252