Addressing the challenges of rare diseases is difficult, especially with the limited number of reference images and a small patient population. This is more evident in rare skin diseases, where we encounter long-tailed data distributions that make it difficult to develop unbiased and broadly effective models. The diverse ways in which image datasets are gathered and their distinct purposes also add to these challenges. Our study conducts a detailed examination of the benefits and drawbacks of episodic and conventional training methodologies, adopting a few-shot learning approach alongside transfer learning. We evaluated our models using the ISIC2018, Derm7pt, and SD-198 datasets. With minimal labeled examples, our models showed substantial information gains and better performance compared to previously trained models. Our research emphasizes the improved ability to represent features in DenseNet121 and MobileNetV2 models, achieved by using pre-trained models on ImageNet to increase similarities within classes. Moreover, our experiments, ranging from 2-way to 5-way classifications with up to 10 examples, showed a growing success rate for traditional transfer learning methods as the number of examples increased. The addition of data augmentation techniques significantly improved our transfer learning based model performance, leading to higher performances than existing methods, especially in the SD-198 and ISIC2018 datasets. All source code related to this work will be made publicly available soon at the provided URL.
解决罕见疾病面临的挑战是困难的,尤其是在参考图像数量有限且患者人口规模较小的情况下。这在罕见皮肤疾病中更加明显,因为我们会遇到具有长尾数据分布的疾病,这使得开发无偏差且具有广泛效果的模型变得困难。图像数据集的收集方式和它们的独特目的也增加了这些挑战。我们的研究详细探讨了周期性训练方法和传统训练方法的优缺点,并采用少量样本学习方法与迁移学习相结合。我们使用ISIC2018、Derm7pt和SD-198数据集来评估我们的模型。由于样本标注数量很少,我们的模型在性能上与之前训练的模型相比取得了很大的信息和特征增益。我们的研究重点是改善DenseNet121和MobileNetV2模型的特征表示能力,通过在ImageNet上预训练模型来增加类内相似度。此外,我们的实验,从2-way到5-way分类,有 up to 10 个样本,表明随着样本数量的增加,传统迁移学习方法的转移学习效果逐渐提高。数据增强技术极大地提高了基于模型的迁移学习性能,特别是在SD-198和ISIC2018数据集上,使得现有方法的性能更优。所有与本研究相关的源代码都将很快在提供的URL上公开发布。
https://arxiv.org/abs/2404.16814
Representation-based Siamese networks have risen to popularity in lightweight text matching due to their low deployment and inference costs. While word-level attention mechanisms have been implemented within Siamese networks to improve performance, we propose Feature Attention (FA), a novel downstream block designed to enrich the modeling of dependencies among embedding features. Employing "squeeze-and-excitation" techniques, the FA block dynamically adjusts the emphasis on individual features, enabling the network to concentrate more on features that significantly contribute to the final classification. Building upon FA, we introduce a dynamic "selection" mechanism called Selective Feature Attention (SFA), which leverages a stacked BiGRU Inception structure. The SFA block facilitates multi-scale semantic extraction by traversing different stacked BiGRU layers, encouraging the network to selectively concentrate on semantic information and embedding features across varying levels of abstraction. Both the FA and SFA blocks offer a seamless integration capability with various Siamese networks, showcasing a plug-and-play characteristic. Experimental evaluations conducted across diverse text matching baselines and benchmarks underscore the indispensability of modeling feature attention and the superiority of the "selection" mechanism.
以表示为基础的孪生网络因低部署和推理成本而在轻量文本匹配中备受欢迎。尽管在孪生网络中已经实现了词级关注机制以提高性能,但我们提出了特征关注(FA)这一新型的下游块,旨在丰富嵌入特征之间的建模。通过采用“收缩和激发”技术,FA块动态地调整对单个特征的强调,使得网络能更关注对最终分类具有重要影响的特征。基于FA,我们引入了一种动态选择机制,称为选择性特征关注(SFA),并利用堆叠BiGRU Inception结构。SFA块通过穿越不同的堆叠BiGRU层,促使网络在不同的抽象层次上集中关注语义信息和嵌入特征。FA和SFA块都具有与各种孪生网络的无缝集成能力,展示了可插拔和 Play 的特点。在多样文本匹配基准和挑战中进行实验评估,结果表明建模特征关注和选择机制至关重要,而“选择”机制的优越性得到了充分证实。
https://arxiv.org/abs/2404.16776
Detection of malignant lesions on mammography images is extremely important for early breast cancer diagnosis. In clinical practice, images are acquired from two different angles, and radiologists can fully utilize information from both views, simultaneously locating the same lesion. However, for automatic detection approaches such information fusion remains a challenge. In this paper, we propose a new model called MAMM-Net, which allows the processing of both mammography views simultaneously by sharing information not only on an object level, as seen in existing works, but also on a feature level. MAMM-Net's key component is the Fusion Layer, based on deformable attention and designed to increase detection precision while keeping high recall. Our experiments show superior performance on the public DDSM dataset compared to the previous state-of-the-art model, while introducing new helpful features such as lesion annotation on pixel-level and classification of lesions malignancy.
在乳腺X光片(mammography)图像中检测恶性病变对于早期乳腺癌诊断至关重要。在临床实践中,图像从两个不同的角度获取,放射科医生可以同时利用这两个角度的信息,同时定位相同的病变。然而,对于自动检测方法,例如信息融合,仍然是一个挑战。在本文中,我们提出了一个名为MAMM-Net的新模型,允许在物体级别共享关于对象的更多信息,同时在特征级别共享关于特征的信息。MAMM-Net的关键组件是融合层,基于可塑性注意力和设计,旨在提高检测精度同时保持高召回率。我们的实验结果表明,与之前的最佳模型相比,在公开的DDSM数据集上具有卓越的性能,同时引入了新的有帮助的功能,如在像素级别对病变进行注释和将病变分类为恶性。
https://arxiv.org/abs/2404.16718
Vision-language models enable open-world classification of objects without the need for any retraining. While this zero-shot paradigm marks a significant advance, even today's best models exhibit skewed performance when objects are dissimilar from their typical depiction. Real world objects such as pears appear in a variety of forms -- from diced to whole, on a table or in a bowl -- yet standard VLM classifiers map all instances of a class to a \it{single vector based on the class label}. We argue that to represent this rich diversity within a class, zero-shot classification should move beyond a single vector. We propose a method to encode and account for diversity within a class using inferred attributes, still in the zero-shot setting without retraining. We find our method consistently outperforms standard zero-shot classification over a large suite of datasets encompassing hierarchies, diverse object states, and real-world geographic diversity, as well finer-grained datasets where intra-class diversity may be less prevalent. Importantly, our method is inherently interpretable, offering faithful explanations for each inference to facilitate model debugging and enhance transparency. We also find our method scales efficiently to a large number of attributes to account for diversity -- leading to more accurate predictions for atypical instances. Finally, we characterize a principled trade-off between overall and worst class accuracy, which can be tuned via a hyperparameter of our method. We hope this work spurs further research into the promise of zero-shot classification beyond a single class vector for capturing diversity in the world, and building transparent AI systems without compromising performance.
视觉语言模型使得无需重新训练即可对开放世界中的物体进行分类。虽然这种零样本范式取得了重大进展,但即使是最先进的模型在物体不与其典型描述相当时也会表现出偏斜的性能。现实世界中的苹果呈现出各种形式——从切成薄片到整个,放在桌子上或碗里——然而,标准视觉语言模型将类别的实例映射到基于类别的单个向量上。我们认为,为了在类中表示这种丰富的多样性,零样本分类应超越单一向量。我们提出了一种方法,通过推断属性来编码和解释类中的多样性,在零样本设置中不需要重新训练。我们发现在一系列包括层次结构、多样物体状态和真实世界地理多样性的大数据集上,我们的方法 consistently优于标准零样本分类。此外,我们的方法具有内在可解释性,为每个推理提供准确的解释,从而促进模型调试和提高透明度。我们还发现,我们的方法能够有效地扩展到大量的属性,以考虑多样性,从而使典型实例的预测更准确。最后,我们描述了总体和最差类准确度之间的原则性权衡,该权衡可以通过我们方法的超参数进行调整。我们希望这项工作能够推动关于零样本分类在捕捉世界多样性方面的前景以及在不牺牲性能的情况下构建透明AI系统的研究。
https://arxiv.org/abs/2404.16717
Visual Instruction Tuning represents a novel learning paradigm involving the fine-tuning of pre-trained language models using task-specific instructions. This paradigm shows promising zero-shot results in various natural language processing tasks but is still unexplored in vision emotion understanding. In this work, we focus on enhancing the model's proficiency in understanding and adhering to instructions related to emotional contexts. Initially, we identify key visual clues critical to visual emotion recognition. Subsequently, we introduce a novel GPT-assisted pipeline for generating emotion visual instruction data, effectively addressing the scarcity of annotated instruction data in this domain. Expanding on the groundwork established by InstructBLIP, our proposed EmoVIT architecture incorporates emotion-specific instruction data, leveraging the powerful capabilities of Large Language Models to enhance performance. Through extensive experiments, our model showcases its proficiency in emotion classification, adeptness in affective reasoning, and competence in comprehending humor. The comparative analysis provides a robust benchmark for Emotion Visual Instruction Tuning in the era of LLMs, providing valuable insights and opening avenues for future exploration in this domain. Our code is available at \url{this https URL}.
视觉指令微调是一种新的学习范式,涉及使用任务特定指令对预训练语言模型进行微调。在这个范式中,我们专注于提高模型在理解并遵循与情感上下文相关的指令方面的能力。首先,我们识别出对视觉情感识别至关重要的关键视觉线索。接着,我们引入了一种新颖的GPT辅助生成情感视觉指令数据的长式依赖关系网络,有效解决了该领域中标注指令数据不足的问题。通过在InstructionBLIP工作的基础上拓展工作,我们提出的EmoVIT架构利用大型语言模型的强大能力来增强性能。通过广泛的实验,我们的模型在情感分类、情感推理和理解幽默方面展现了卓越的表现。比较分析为LLM时代的情感视觉指令微调提供了一个稳健的基准,为这个领域提供了宝贵的见解,并开拓了未来的研究方向。我们的代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.16670
Linguistic ambiguity continues to represent a significant challenge for natural language processing (NLP) systems, notwithstanding the advancements in architectures such as Transformers and BERT. Inspired by the recent success of instructional models like ChatGPT and Gemini (In 2023, the artificial intelligence was called Bard.), this study aims to analyze and discuss linguistic ambiguity within these models, focusing on three types prevalent in Brazilian Portuguese: semantic, syntactic, and lexical ambiguity. We create a corpus comprising 120 sentences, both ambiguous and unambiguous, for classification, explanation, and disambiguation. The models capability to generate ambiguous sentences was also explored by soliciting sets of sentences for each type of ambiguity. The results underwent qualitative analysis, drawing on recognized linguistic references, and quantitative assessment based on the accuracy of the responses obtained. It was evidenced that even the most sophisticated models, such as ChatGPT and Gemini, exhibit errors and deficiencies in their responses, with explanations often providing inconsistent. Furthermore, the accuracy peaked at 49.58 percent, indicating the need for descriptive studies for supervised learning.
语言歧义一直是自然语言处理(NLP)系统的一个显著挑战,尽管像Transformer和BERT这样的架构取得了进步。受到类似ChatGPT和Gemini等 recent instructional models的成功启发,本研究旨在分析并讨论这些模型中的语言歧义,重点关注巴西葡萄牙语中三种普遍存在的歧义类型:语义、句法 和词汇歧义。我们创建了一个包括120个句子的语料库,包括歧义和明确语义两种,用于分类、解释和去歧义。还研究了模型生成歧义句的能力,通过要求针对每种歧义类型提供一组句子。结果经过定性分析,基于公认的语言参考,以及基于所获回答的准确性的定量评估。结果显示,即使是最先进的模型,如ChatGPT和Gemini,在其回应中也有错误和不足之处,解释往往是不一致的。此外,准确率在49.58%达到峰值,表明需要进行描述性研究来进行有监督学习。
https://arxiv.org/abs/2404.16653
Multi-modal foundation models such as CLIP have showcased impressive zero-shot capabilities. However, their applicability in resource-constrained environments is limited due to their large number of parameters and high inference time. While existing approaches have scaled down the entire CLIP architecture, we focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification. The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and linear probe performance. However, we find that this approach surprisingly fails in true zero-shot settings when using contrastive losses. We identify the exploitation of spurious features as being responsible for poor generalization between synthetic and real data. However, by using the image feature-based L2 distillation loss, we mitigate these problems and train students that achieve zero-shot performance which on four domain-specific datasets is on-par with a ViT-B/32 teacher model trained on DataCompXL, while featuring up to 92% fewer parameters.
多模态基础模型,如CLIP,已经展示了令人印象深刻的零样本能力。然而,由于它们具有大量参数和高推理时间,这些模型在资源受限的环境中的应用有限。虽然现有的方法已经将整个CLIP架构缩小,但我们关注于训练更小的图像编码器变体,这对于高效的零样本分类是足够的。使用合成数据已经表明,从更大的教师表示中提取表示具有潜力,导致强大的零样本和线性探测性能。然而,我们发现,在真正的零样本设置中,这种方法在对比损失方面表现令人失望。我们发现,这种方法在合成和真实数据之间的泛化差上存在问题。然而,通过使用基于图像特征的L2蒸馏损失,我们缓解了这些问题,并培训学生实现零样本性能,这在与DataCompXL数据集上训练的ViT-B/32教师模型相当的四域特定数据集上。
https://arxiv.org/abs/2404.16637
Unsupervised cross-lingual transfer involves transferring knowledge between languages without explicit supervision. Although numerous studies have been conducted to improve performance in such tasks by focusing on cross-lingual knowledge, particularly lexical and syntactic knowledge, current approaches are limited as they only incorporate syntactic or lexical information. Since each type of information offers unique advantages and no previous attempts have combined both, we attempt to explore the potential of this approach. In this paper, we present a novel framework called "Lexicon-Syntax Enhanced Multilingual BERT" that combines both lexical and syntactic knowledge. Specifically, we use Multilingual BERT (mBERT) as the base model and employ two techniques to enhance its learning capabilities. The code-switching technique is used to implicitly teach the model lexical alignment information, while a syntactic-based graph attention network is designed to help the model encode syntactic structure. To integrate both types of knowledge, we input code-switched sequences into both the syntactic module and the mBERT base model simultaneously. Our extensive experimental results demonstrate this framework can consistently outperform all baselines of zero-shot cross-lingual transfer, with the gains of 1.0~3.7 points on text classification, named entity recognition (ner), and semantic parsing tasks. Keywords:cross-lingual transfer, lexicon, syntax, code-switching, graph attention network
无监督跨语言转移涉及在没有任何明确监督的情况下在语言之间传递知识。尽管已经进行了大量研究,以通过关注跨语言知识来提高此类任务的性能,特别是词汇和句法知识,但目前的方法仍然有限,因为它们仅包括语义或词汇信息。由于每种信息都具有独特的优势,并且没有 previous 尝试将两种信息相结合,因此我们试图探索这种方法的潜力。在本文中,我们提出了一个名为 "Lexicon-Syntax Enhanced Multilingual BERT" 的新框架,结合了词汇和句法知识。具体来说,我们使用多语言 BERT(mBERT)作为基础模型,并采用两种技术来增强其学习能力。代码转换技术用于含蓄地教导模型词汇对齐信息,而基于句法的图注意力网络旨在帮助模型编码语义结构。为了整合两种知识,我们将代码转换序列同时输入到语义模块和 mBERT 基础模型中。我们进行了广泛的实验研究,结果表明,与其他零散的跨语言转移 baseline 相比,该框架可以始终如一地优于所有基线,在文本分类、命名实体识别(NER)和语义解析任务中的得分增加了 1.0~3.7 点。关键词:跨语言转移,词汇,语法,代码转换,图注意力网络
https://arxiv.org/abs/2404.16627
In recent years, with the rapid development of computer information technology, the development of artificial intelligence has been accelerating. The traditional geometry recognition technology is relatively backward and the recognition rate is low. In the face of massive information database, the traditional algorithm model inevitably has the problems of low recognition accuracy and poor performance. Deep learning theory has gradually become a very important part of machine learning. The implementation of convolutional neural network (CNN) reduces the difficulty of graphics generation algorithm. In this paper, using the advantages of lenet-5 architecture sharing weights and feature extraction and classification, the proposed geometric pattern recognition algorithm model is faster in the training data set. By constructing the shared feature parameters of the algorithm model, the cross-entropy loss function is used in the recognition process to improve the generalization of the model and improve the average recognition accuracy of the test data set.
近年来,随着计算机信息技术的快速发展,人工智能的发展也加速了。传统的几何识别技术相对较落后,识别率也较低。面对大规模的信息数据库,传统的算法模型无疑存在识别准确度低和性能差的问题。深度学习理论逐渐成为机器学习的重要组成部分。卷积神经网络(CNN)的实现减化了图形生成算法的难度。在本文中,利用lenet-5架构共享权重和特征提取与分类的优势,所提出的几何模式识别算法模型在训练数据集上训练速度更快。通过构建算法模型的共享特征参数,交叉熵损失函数在识别过程中用于提高模型的泛化能力和测试数据集的平均识别准确度。
https://arxiv.org/abs/2404.16561
Beyond improving trust and validating model fairness, xAI practices also have the potential to recover valuable scientific insights in application domains where little to no prior human intuition exists. To that end, we propose a method to extract global concept explanations from the predictions of graph neural networks to develop a deeper understanding of the tasks underlying structure-property relationships. We identify concept explanations as dense clusters in the self-explaining Megan models subgraph latent space. For each concept, we optimize a representative prototype graph and optionally use GPT-4 to provide hypotheses about why each structure has a certain effect on the prediction. We conduct computational experiments on synthetic and real-world graph property prediction tasks. For the synthetic tasks we find that our method correctly reproduces the structural rules by which they were created. For real-world molecular property regression and classification tasks, we find that our method rediscovers established rules of thumb. More specifically, our results for molecular mutagenicity prediction indicate more fine-grained resolution of structural details than existing explainability methods, consistent with previous results from chemistry literature. Overall, our results show promising capability to extract the underlying structure-property relationships for complex graph property prediction tasks.
除了提高信任度和验证模型的公平性外,基于AI的研究还有可能在缺乏先前人类直觉的应用领域中恢复有价值的科学见解。为此,我们提出了一种从图神经网络的预测中提取全局概念解释的方法,以更深入地理解支撑任务结构与属性之间关系的任务结构。我们将概念解释确定为自解释Megan模型的子图潜在空间中的密集聚类。对于每个概念,我们优化一个具有代表性的图,并可选地使用GPT-4来提供关于每个结构对预测的影响的假设。我们在合成和现实世界的图属性预测任务上进行计算实验。对于合成任务,我们发现我们的方法正确地复制了它们创建的结构规则。对于现实世界的分子属性回归和分类任务,我们发现我们的方法重新发现了已有的经验法则。具体来说,我们的分子突变预测结果表明,我们的方法比现有的解释性方法具有更细粒度的结构细节的分辨率,这与化学文献中的 previous results 相一致。总体而言,我们的结果表明,基于AI的研究具有从复杂图属性预测任务中提取底层结构与属性之间关系的有前景的能力。
https://arxiv.org/abs/2404.16532
In deep learning applications, robustness measures the ability of neural models that handle slight changes in input data, which could lead to potential safety hazards, especially in safety-critical applications. Pre-deployment assessment of model robustness is essential, but existing methods often suffer from either high costs or imprecise results. To enhance safety in real-world scenarios, metrics that effectively capture the model's robustness are needed. To address this issue, we compare the rigour and usage conditions of various assessment methods based on different definitions. Then, we propose a straightforward and practical metric utilizing hypothesis testing for probabilistic robustness and have integrated it into the TorchAttacks library. Through a comparative analysis of diverse robustness assessment methods, our approach contributes to a deeper understanding of model robustness in safety-critical applications.
在深度学习应用中,稳健性测量处理输入数据微小变化的能力,可能导致潜在的安全风险,特别是在关键安全应用中。对模型稳健性的预部署评估至关重要,但现有方法通常存在成本高或结果不精确的问题。为了提高现实场景中的安全性,需要有效的指标来捕捉模型的稳健性。为了解决这个问题,我们根据不同的定义比较了各种评估方法的严谨性和使用条件。然后,我们提出了一个简单而实际的概率鲁棒性指标,并将其集成到TorchAttacks库中。通过比较不同鲁棒性评估方法的比较分析,我们的方法为关键安全应用中模型的稳健性提供了更深入的理解。
https://arxiv.org/abs/2404.16457
The interactions between tumor cells and the tumor microenvironment (TME) dictate therapeutic efficacy of radiation and many systemic therapies in breast cancer. However, to date, there is not a widely available method to reproducibly measure tumor and immune phenotypes for each patient's tumor. Given this unmet clinical need, we applied multiple instance learning (MIL) algorithms to assess activity of ten biologically relevant pathways from the hematoxylin and eosin (H&E) slide of primary breast tumors. We employed different feature extraction approaches and state-of-the-art model architectures. Using binary classification, our models attained area under the receiver operating characteristic (AUROC) scores above 0.70 for nearly all gene expression pathways and on some cases, exceeded 0.80. Attention maps suggest that our trained models recognize biologically relevant spatial patterns of cell sub-populations from H&E. These efforts represent a first step towards developing computational H&E biomarkers that reflect facets of the TME and hold promise for augmenting precision oncology.
肿瘤细胞与肿瘤微环境(TME)之间的相互作用决定了放射治疗和许多系统治疗在乳腺癌中的治疗效果。然而,目前还没有一种可重复测量每个患者肿瘤的肿瘤和免疫表型的广泛可用方法。鉴于这一未满足的临床需求,我们将多实例学习(MIL)算法应用于从原始乳腺癌的哈希和电子显微镜(H&E)切片评估十种生物相关的通路的活动。我们采用了不同的特征提取方法和最先进的模型架构。使用二分类,我们的模型在几乎所有基因表达通路上的接收者操作特征(AUROC)分数都超过了0.70,在某些情况下甚至超过了0.80。注意力图表明,经过训练的模型能够识别H&E中的细胞亚群的空间模式。这些努力代表了解决计算H&E生物标志物的第一步,这些生物标志物可以反映TME的方面,并具有提高精准癌症治疗的精度的潜力。
https://arxiv.org/abs/2404.16397
Deep convolutional neural networks (DCNNs) are a class of artificial neural networks, primarily for computer vision tasks such as segmentation and classification. Many nonlinear operations, such as activation functions and pooling strategies, are used in DCNNs to enhance their ability to process different signals with different tasks. Conceptional convolution, a linear filter, is the essential component of DCNNs while nonlinear convolution is generally implemented as higher-order Volterra filters, However, for Volterra filtering, significant memory and computational costs pose a primary limitation for its widespread application in DCNN applications. In this study, we propose a novel method to perform higher-order Volterra filtering with lower memory and computation cost in forward and backward pass in DCNN training. The proposed method demonstrates computational advantages compared with conventional Volterra filter implementation. Furthermore, based on the proposed method, a new attention module called Higher-order Local Attention Block (HLA) is proposed and tested on CIFAR-100 dataset, which shows competitive improvement for classification task. Source code is available at: this https URL
深度卷积神经网络(DCNNs)是一种用于计算机视觉任务(如分割和分类)的人工神经网络。许多非线性操作,如激活函数和池化策略,用于增强DCNNs处理不同任务的能力。概念上的卷积是DCNNs的关键组成部分,而通常非线性卷积通过高阶Volterra滤波器实现。然而,对于Volterra滤波器,在DCNN应用中广泛应用的记忆和计算成本方面的限制尤为突出。在这项研究中,我们提出了一种在DCNN训练过程中实现高阶Volterra滤波且具有较低记忆和计算成本的新方法。与传统Volterra滤波器实现相比,该方法具有计算优势。此外,根据所提出的方法,还提出并测试了一个名为 Higher-order Local Attention Block (HLA) 的自注意力模块,在CIFAR-100 数据集上进行了分类任务的测试,其分类性能具有竞争力的提升。源代码可在此处访问:https://this URL
https://arxiv.org/abs/2404.16380
Prompt learning has become the most effective paradigm for adapting large pre-trained vision-language models (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset.
prompt learning已成为将大型预训练视觉语言模型(VLMs)适应下游任务的最具效力的范式。最近,无监督提示调整方法(如UPL和POUF)直接利用伪标签作为监督信息,在未标注数据上微调附加适应模块。然而,不准确的伪标签容易误导调整过程,导致表现不佳。针对这个问题,我们提出了 Training-Free Unsupervised Prompts (TFUP),在训练-免费和无标签的方式下,保留预训练模型的固有表示能力,并将其与基于相似度的预测概率的残差连接来增强表现。具体来说,我们将实例置信度和原型分数相结合,以选择具有代表性的样本,用于为无标签数据训练-免费的特征缓存模型(FCM)。然后,我们设计了一个多级相似度度量(MSM),考虑特征级别和语义级别的相似性,计算测试图像与缓存样本之间的距离,作为对应缓存标签的权重,生成基于相似度的预测概率。这样,TFUP 在多分类数据集上实现了惊人的表现,甚至超过了基于训练的方法。根据我们的TFUP,我们提出了一个基于训练的改进方法(TFUP-T),以进一步提高适应性能。除了标准的交叉熵损失外,TFUP-T还采用了一种额外的边际分布熵损失来约束模型从全局角度出发。我们的TFUP-T在多个基准数据集上的表现与无监督和少样本调整方法相当。特别是,TFUP-T 通过在最具挑战性的Domain-Net数据集上将POUF的分类准确率提高了3.3%而超过了该方法。
https://arxiv.org/abs/2404.16339
Model Weight Averaging (MWA) is a technique that seeks to enhance model's performance by averaging the weights of multiple trained models. This paper first empirically finds that 1) the vanilla MWA can benefit the class-imbalanced learning, and 2) performing model averaging in the early epochs of training yields a greater performance improvement than doing that in later epochs. Inspired by these two observations, in this paper we propose a novel MWA technique for class-imbalanced learning tasks named Iterative Model Weight Averaging (IMWA). Specifically, IMWA divides the entire training stage into multiple episodes. Within each episode, multiple models are concurrently trained from the same initialized model weight, and subsequently averaged into a singular model. Then, the weight of this average model serves as a fresh initialization for the ensuing episode, thus establishing an iterative learning paradigm. Compared to vanilla MWA, IMWA achieves higher performance improvements with the same computational cost. Moreover, IMWA can further enhance the performance of those methods employing EMA strategy, demonstrating that IMWA and EMA can complement each other. Extensive experiments on various class-imbalanced learning tasks, i.e., class-imbalanced image classification, semi-supervised class-imbalanced image classification and semi-supervised object detection tasks showcase the effectiveness of our IMWA.
模型加权平均(MWA)是一种通过平均多个训练模型的权重来提高模型性能的技术。本文首先通过实验实证发现,1)普通MWA对类不平衡学习有利,2)在训练早期进行模型平均比在训练后期进行模型平均效果更好。受到这两个观察结果的启发,本文提出了一种名为迭代模型权重平均(IMWA)的新MWA技术,用于类不平衡学习任务。具体来说,IMWA将整个训练阶段划分为多个 episode。在每个 episode 中,多个模型从相同的初始模型权重并行训练,然后将这些模型的权重平均成一个单一的模型。接着,这个平均模型的权重成为后续episode的初始化,从而建立了一个迭代学习范式。与普通MWA相比,IMWA在相同的计算成本下实现了更高的性能改进。此外,IMWA还可以进一步增强使用EMA策略的方法的性能,表明IMWA和EMA可以相互补充。在各种类不平衡学习任务上(即类不平衡图像分类、半监督类不平衡图像分类和半监督目标检测任务)的广泛实验表明了IMWA的有效性。
https://arxiv.org/abs/2404.16331
Foundation models contain a wealth of information from their vast number of training samples. However, most prior arts fail to extract this information in a precise and efficient way for small sample sizes. In this work, we propose a framework utilizing reinforcement learning as a control for foundation models, allowing for the granular generation of small, focused synthetic support sets to augment the performance of neural network models on real data classification tasks. We first allow a reinforcement learning agent access to a novel context based dictionary; the agent then uses this dictionary with a novel prompt structure to form and optimize prompts as inputs to generative models, receiving feedback based on a reward function combining the change in validation accuracy and entropy. A support set is formed this way over several exploration steps. Our framework produced excellent results, increasing classification accuracy by significant margins for no additional labelling or data cost.
基础模型包含大量训练样本中所学到的丰富信息。然而,大多数先前的艺术作品在小型样本量的情况下无法精确有效地提取这些信息。在本文中,我们提出了一种利用强化学习作为基础模型控制的方法,允许在小型、关注点状的合成支持集上生成细粒度的支持集,以提高神经网络模型在真实数据分类任务上的性能。我们首先允许一个强化学习代理访问一个新颖的上下文基词表;然后,代理使用此基词表与新颖的提示结构形成和优化提示作为输入,根据基于验证准确性和熵的奖励函数接收反馈。通过几次探索步骤,这样就可以形成一个支持集。我们的框架产生了很好的结果,在不需要额外标签或数据成本的情况下,将分类准确度提高了显著的幅度。
https://arxiv.org/abs/2404.16300
Pooling layers (e.g., max and average) may overlook important information encoded in the spatial arrangement of pixel intensity and/or feature values. We propose a novel lacunarity pooling layer that aims to capture the spatial heterogeneity of the feature maps by evaluating the variability within local windows. The layer operates at multiple scales, allowing the network to adaptively learn hierarchical features. The lacunarity pooling layer can be seamlessly integrated into any artificial neural network architecture. Experimental results demonstrate the layer's effectiveness in capturing intricate spatial patterns, leading to improved feature extraction capabilities. The proposed approach holds promise in various domains, especially in agricultural image analysis tasks. This work contributes to the evolving landscape of artificial neural network architectures by introducing a novel pooling layer that enriches the representation of spatial features. Our code is publicly available.
池化层(例如,最大和平均池化层)可能忽略了像素强度和/或特征值空间排列中编码的重要信息。我们提出了一种新型的局部化池化层,旨在通过评估局部窗口内的方差来捕捉特征图的空间异质性。该层在多个尺度上运行,允许网络自适应地学习层次特征。局部化池化层可以轻松地集成到任何人工神经网络架构中。实验结果表明,该层有效地捕捉了复杂的空间模式,从而提高了特征提取能力。与农业图像分析任务相关的各种领域都具有重要意义。通过引入一种新颖的池化层,丰富了空间特征的表示,为人工神经网络架构的发展做出了贡献。我们的代码是公开可用的。
https://arxiv.org/abs/2404.16268
AutoGluon-Multimodal (AutoMM) is introduced as an open-source AutoML library designed specifically for multimodal learning. Distinguished by its exceptional ease of use, AutoMM enables fine-tuning of foundational models with just three lines of code. Supporting various modalities including image, text, and tabular data, both independently and in combination, the library offers a comprehensive suite of functionalities spanning classification, regression, object detection, semantic matching, and image segmentation. Experiments across diverse datasets and tasks showcases AutoMM's superior performance in basic classification and regression tasks compared to existing AutoML tools, while also demonstrating competitive results in advanced tasks, aligning with specialized toolboxes designed for such purposes.
AutoGluon-Multimodal(AutoMM)是一个专为多模态学习而设计的开源 AutoML 库。它以其出色的易用性而著称,使用户只需三行代码即可微调基本模型。支持各种模态,包括图像、文本和表格数据,独立或组合,库提供了全面的函数集,包括分类、回归、目标检测、语义匹配和图像分割。通过在各种数据集和任务上的实验,展示了 AutoMM 在基本分类和回归任务上优越的性能,同时在高级任务上也有竞争力的表现,与专门为此目的设计的工具箱相吻合。
https://arxiv.org/abs/2404.16233
The global decline in bee populations poses significant risks to agriculture, biodiversity, and environmental stability. To bridge the gap in existing data, we introduce ApisTox, a comprehensive dataset focusing on the toxicity of pesticides to honey bees (Apis mellifera). This dataset combines and leverages data from existing sources such as ECOTOX and PPDB, providing an extensive, consistent, and curated collection that surpasses the previous datasets. ApisTox incorporates a wide array of data, including toxicity levels for chemicals, details such as time of their publication in literature, and identifiers linking them to external chemical databases. This dataset may serve as an important tool for environmental and agricultural research, but also can support the development of policies and practices aimed at minimizing harm to bee populations. Finally, ApisTox offers a unique resource for benchmarking molecular property prediction methods on agrochemical compounds, facilitating advancements in both environmental science and cheminformatics. This makes it a valuable tool for both academic research and practical applications in bee conservation.
全球蜜蜂数量下降对农业、生物多样性以及环境稳定性造成了显著的风险。为了填补现有数据之间的空白,我们引入了ApisTox,一个关注农药对蜜蜂(Apis mellifera)毒性的全面数据集。这个数据集将现有来源的数据进行了整合和利用,提供了一个广泛、一致和经过编目的收集,超越了以前的数据集。ApisTox包含了各种数据,包括化学物质的毒性水平,以及在文献中发布的时间,以及将它们与外部化学数据库连接的标识器。这个数据集可以为环境学和农业研究提供一个重要的工具,但还可以为减少对蜜蜂种群的危害的政策和实践提供支持。最后,ApisTox为研究农业化学化合物分子性质的预测方法提供了一个独特的资源,促进了两者的进步。这使得它成为学术研究和蜜蜂保护实践的有价值的工具。
https://arxiv.org/abs/2404.16196
The recent prevalence of publicly accessible, large medical imaging datasets has led to a proliferation of artificial intelligence (AI) models for cardiovascular image classification and analysis. At the same time, the potentially significant impacts of these models have motivated the development of a range of explainable AI (XAI) methods that aim to explain model predictions given certain image inputs. However, many of these methods are not developed or evaluated with domain experts, and explanations are not contextualized in terms of medical expertise or domain knowledge. In this paper, we propose a novel framework and python library, MiMICRI, that provides domain-centered counterfactual explanations of cardiovascular image classification models. MiMICRI helps users interactively select and replace segments of medical images that correspond to morphological structures. From the counterfactuals generated, users can then assess the influence of each segment on model predictions, and validate the model against known medical facts. We evaluate this library with two medical experts. Our evaluation demonstrates that a domain-centered XAI approach can enhance the interpretability of model explanations, and help experts reason about models in terms of relevant domain knowledge. However, concerns were also surfaced about the clinical plausibility of the counterfactuals generated. We conclude with a discussion on the generalizability and trustworthiness of the MiMICRI framework, as well as the implications of our findings on the development of domain-centered XAI methods for model interpretability in healthcare contexts.
近年来,公开可获取的大型医疗影像数据集的普及导致了许多心血管图像分类和分析的人工智能(AI)模型的出现。与此同时,这些模型的潜在影响也促使开发了一系列可解释AI(XAI)方法,旨在解释给定图像输入的模型预测。然而,许多这些方法都没有经过领域专家的开发或评估,并且解释没有针对医疗专业知识或领域知识进行contextual化。在本文中,我们提出了一个新颖的框架和Python库,MiMICRI,为心血管图像分类模型的领域中心反事实解释提供支持。MiMICRI使用户可以交互式选择和替换医学图像中与形态结构对应的区域。从反事实产生的结果中,用户可以 then评估每个片段对模型预测的影响,并验证模型是否符合已知医疗事实。我们对这个库进行了两个医疗专家的评估。我们的评估表明,以领域为中心的XAI方法可以增强模型解释的可解释性,并帮助专家在相关领域知识的基础上对模型进行推理。然而,也担忧反事实产生的临床可解释性。我们得出结论,MiMICRI框架的可解释性和可靠性,以及我们的研究结果对 healthcare 环境中模型可解释性发展的影响,都存在一定的意义。
https://arxiv.org/abs/2404.16174