Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data, such as cross-modal retrieval. The adaptation of CLIP to this specific domain has relied on model fine-tuning with the standard contrastive objective, using existing human-labeled image-caption datasets, or using synthetic data corresponding to image-caption pairs derived from other annotations over remote sensing images (e.g., object classes). The use of different pre-training mechanisms has received less attention, and only a few exceptions have considered multilingual inputs. This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model and testing the use of a self-supervised method based on aligning local and global representations from individual input images, together with the standard CLIP objective. Model training relied on assembling pre-existing datasets of remote sensing images paired with English captions, followed by the use of automated machine translation into nine additional languages. We show that translated data is indeed helpful, e.g. improving performance also on English. Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks, including cross-modal and multilingual image-text retrieval, or zero-shot image classification.
基于对比语言-图像预训练(Contrastive Language-Image Pre-training,简称CLIP)的方法如今广泛应用于涉及遥感数据的视觉与语言任务中,例如跨模态检索。将CLIP适应于这一特定领域主要依赖于使用现有的人工标注图像描述数据集或来自其他遥感图像注释(如对象类别)的合成数据进行模型微调的标准对比目标。不同预训练机制的应用受到了较少的关注,而且只有少数例外考虑了多语言输入。本研究提出了一种用于遥感领域的新型视觉与语言模型,探索对多语言CLIP模型进行微调,并测试基于从单个输入图像中对齐局部和全局表示的自监督方法,结合标准CLIP目标的应用。模型训练依赖于将现有的遥感图像与英文描述配对的数据集组合起来,然后使用自动机器翻译将其转换成另外九种语言。我们展示了翻译数据确实是有帮助的,例如提高了在英语上的性能。我们的最终模型,我们称之为“Remote Sensing Multilingual CLIP”(RS-M-CLIP),在各种视觉与语言任务中获得了最先进的结果,包括跨模态和多语言图像文本检索或零样本图像分类。
https://arxiv.org/abs/2410.23370
In many modern computer application problems, the classification of image data plays an important role. Among many different supervised machine learning models, convolutional neural networks (CNNs) and linear discriminant analysis (LDA) as well as sophisticated variants thereof are popular techniques. In this work, two different domain decomposed CNN models are experimentally compared for different image classification problems. Both models are loosely inspired by domain decomposition methods and in addition, combined with a transfer learning strategy. The resulting models show improved classification accuracies compared to the corresponding, composed global CNN model without transfer learning and besides, also help to speed up the training process. Moreover, a novel decomposed LDA strategy is proposed which also relies on a localization approach and which is combined with a small neural network model. In comparison with a global LDA applied to the entire input data, the presented decomposed LDA approach shows increased classification accuracies for the considered test problems.
在许多现代计算机应用问题中,图像数据的分类扮演着重要角色。在众多不同的监督机器学习模型中,卷积神经网络(CNN)和线性判别分析(LDA)及其复杂变体是流行的技巧。本研究实验对比了两种不同领域的分解CNN模型用于不同的图像分类问题。这两种模型都受到了领域分解方法的启发,并且还结合了迁移学习策略。与对应的、未采用迁移学习的全局组合CNN模型相比,这些结果模型展现了更高的分类准确性,并有助于加快训练过程。此外,提出了一种新颖的分解LDA策略,该策略依赖于定位方法,并结合了一个小型神经网络模型。相较于应用于整个输入数据上的全局LDA,所提出的分解LDA方法对于考虑的测试问题展示了增加的分类准确性。
https://arxiv.org/abs/2410.23359
Machine unlearning (MU) has gained significant attention as a means to remove specific data from trained models without requiring a full retraining process. While progress has been made in unimodal domains like text and image classification, unlearning in multimodal models remains relatively underexplored. In this work, we address the unique challenges of unlearning in CLIP, a prominent multimodal model that aligns visual and textual representations. We introduce CLIPErase, a novel approach that disentangles and selectively forgets both visual and textual associations, ensuring that unlearning does not compromise model performance. CLIPErase consists of three key modules: a Forgetting Module that disrupts the associations in the forget set, a Retention Module that preserves performance on the retain set, and a Consistency Module that maintains consistency with the original model. Extensive experiments on the CIFAR-100 and Flickr30K datasets across four CLIP downstream tasks demonstrate that CLIPErase effectively forgets designated associations in zero-shot tasks for multimodal samples, while preserving the model's performance on the retain set after unlearning.
机器遗忘(MU)作为一种无需完整重新训练过程即可从已训练模型中移除特定数据的方法,已经引起了广泛关注。尽管在文本和图像分类等单模态领域取得了进展,但在多模态模型中的遗忘研究仍然相对较少。在这项工作中,我们解决了CLIP(一种突出的将视觉与文本表示对齐的多模态模型)中特有的遗忘挑战。我们提出了CLIPErase,这是一种新颖的方法,能够解耦并选择性地忘记视觉和文本关联,确保遗忘过程不会损害模型性能。CLIPErase由三个关键模块组成:一个遗忘模块,用于破坏需要遗忘集中的关联;一个保留模块,用于保持对保留集的性能;以及一个一致性模块,用于维持与原始模型的一致性。通过在CIFAR-100和Flickr30K数据集上的广泛实验,验证了CLIPErase能够有效忘记多模态样本零样本任务中指定的相关性,同时在遗忘后仍保持模型对保留集的性能。
https://arxiv.org/abs/2410.23330
Breast cancer histopathology image classification is crucial for early cancer detection, offering the potential to reduce mortality rates through timely diagnosis. This paper introduces a novel approach integrating Hybrid EfficientNet models with advanced attention mechanisms, including Convolutional Block Attention Module (CBAM), Self-Attention, and Deformable Attention, to enhance feature extraction and focus on critical image regions. We evaluate the performance of our models across multiple magnification scales using publicly available histopathological datasets. Our method achieves significant improvements, with accuracy reaching 98.42% at 400X magnification, surpassing several state-of-the-art models, including VGG and ResNet architectures. The results are validated using metrics such as accuracy, F1-score, precision, and recall, demonstrating the clinical potential of our model in improving diagnostic accuracy. Furthermore, the proposed method shows increased computational efficiency, making it suitable for integration into real-time diagnostic workflows.
乳腺癌组织病理图像分类对于早期癌症检测至关重要,通过及时诊断可以降低死亡率。本文介绍了一种将Hybrid EfficientNet模型与先进的注意力机制(包括卷积块注意力模块(CBAM)、自注意力和可变形注意力)相结合的新方法,以增强特征提取并专注于关键的图像区域。我们使用公开的组织病理学数据集,在多个放大倍数下评估了模型性能。我们的方法取得了显著改进,在400X放大倍率下的准确率达到98.42%,超过了包括VGG和ResNet架构在内的几种最先进的模型。通过准确性、F1值、精确度和召回率等指标验证了结果,展示了我们模型在提高诊断准确性方面的临床潜力。此外,所提出的方法计算效率更高,适合集成到实时诊断工作流中。
https://arxiv.org/abs/2410.22392
Pre-trained vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot performance on a wide range of downstream computer vision tasks. However, there still exists a considerable performance gap between these models and a supervised deep model trained on a downstream dataset. To bridge this gap, we propose a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training. To achieve this, our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection. Our extensive experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets, and significantly enhances the zero-shot performance of VLMs.
预训练的视觉语言模型(VLM)如CLIP在广泛的下游计算机视觉任务中展示了令人印象深刻的零样本性能。然而,这些模型与基于下游数据集训练的监督深度模型之间仍存在显著的性能差距。为了缩小这一差距,我们提出了一种新的主动学习(AL)框架,通过在训练过程中仅从未标记的数据中选择少量的信息样本进行标注来提高VLMs的零样本分类性能。为此,我们的方法首先校准了VLM预测的熵,然后利用自不确定性和邻居感知不确定性相结合的方式来计算用于主动样本文本选择的可靠不确定性度量。我们广泛的实验表明,所提出的方法在多个图像分类数据集上优于现有的AL方法,并显著提升了VLMs的零样本性能。
https://arxiv.org/abs/2410.22187
We propose a novel teacher-student framework to distill knowledge from multiple teachers trained on distinct datasets. Each teacher is first trained from scratch on its own dataset. Then, the teachers are combined into a joint architecture, which fuses the features of all teachers at multiple representation levels. The joint teacher architecture is fine-tuned on samples from all datasets, thus gathering useful generic information from all data samples. Finally, we employ a multi-level feature distillation procedure to transfer the knowledge to a student model for each of the considered datasets. We conduct image classification experiments on seven benchmarks, and action recognition experiments on three benchmarks. To illustrate the power of our feature distillation procedure, the student architectures are chosen to be identical to those of the individual teachers. To demonstrate the flexibility of our approach, we combine teachers with distinct architectures. We show that our novel Multi-Level Feature Distillation (MLFD) can significantly surpass equivalent architectures that are either trained on individual datasets, or jointly trained on all datasets at once. Furthermore, we confirm that each step of the proposed training procedure is well motivated by a comprehensive ablation study. We publicly release our code at this https URL.
我们提出了一种新颖的师生框架,用于从训练在不同数据集上的多个教师模型中提取知识。每个教师模型首先在其自己的数据集上从头开始进行训练。然后,将这些教师模型组合成一个联合架构,在该架构中,融合了所有教师模型在多级表示中的特征。接着,联合教师架构会在来自所有数据集的样本上进行微调,从而收集来自所有数据样本中有用的通用信息。最后,我们采用一个多级特征蒸馏过程,将知识转移到每个考虑的数据集的学生模型上。 我们在七个基准数据集上进行了图像分类实验,并在三个基准数据集上进行了动作识别实验。为了展示我们的特征蒸馏程序的强大性,学生架构被选择为与各自教师相同的架构。为了证明我们方法的灵活性,我们将具有不同架构的教师结合起来。结果显示,我们新颖的多级特征蒸馏(MLFD)可以显著超越那些单独在单个数据集上训练或一次性联合所有数据集训练的等效架构。 此外,通过全面的消融研究,我们证实了所提出的训练程序中的每一步都是有充分理由的。我们的代码已公开发布于此 https URL。
https://arxiv.org/abs/2410.22184
Recently, Vision Transformers (ViTs) have achieved unprecedented effectiveness in the general domain of image classification. Nonetheless, these models remain underexplored in the field of deepfake detection, given their lower performance as compared to Convolution Neural Networks (CNNs) in that specific context. In this paper, we start by investigating why plain ViT architectures exhibit a suboptimal performance when dealing with the detection of facial forgeries. Our analysis reveals that, as compared to CNNs, ViT struggles to model localized forgery artifacts that typically characterize deepfakes. Based on this observation, we propose a deepfake detection framework called FakeFormer, which extends ViTs to enforce the extraction of subtle inconsistency-prone information. For that purpose, an explicit attention learning guided by artifact-vulnerable patches and tailored to ViTs is introduced. Extensive experiments are conducted on diverse well-known datasets, including FF++, Celeb-DF, WildDeepfake, DFD, DFDCP, and DFDC. The results show that FakeFormer outperforms the state-of-the-art in terms of generalization and computational cost, without the need for large-scale training datasets. The code is available at \url{this https URL}.
最近,视觉变换器(ViTs)在图像分类的一般领域中取得了前所未有的效果。然而,在深度伪造检测领域,由于与卷积神经网络(CNNs)相比性能较低,这些模型仍较少被探索。本文首先研究了为什么普通的ViT架构在处理面部伪造检测时表现不佳。我们的分析表明,相比于CNNs,ViTs难以建模局部的伪造特征,而这类特征通常代表深度伪造的特点。基于这一观察,我们提出了一种名为FakeFormer的深度伪造检测框架,它扩展了ViTs以强化提取细微且易于不一致性的信息。为此,引入了一个由易受伪造影响区域引导并专门为ViTs定制的显式注意力学习机制。我们在多个知名数据集上进行了广泛的实验,包括FF++、Celeb-DF、WildDeepfake、DFD、DFDCP和DFDC。结果表明,FakeFormer在泛化能力和计算成本方面超越了现有最优方法,并且无需大规模训练数据集。代码可在\url{此 https URL}获得。
https://arxiv.org/abs/2410.21964
This study investigates the application of Bayesian Optimization (BO) for the hyperparameter tuning of neural networks, specifically targeting the enhancement of Convolutional Neural Networks (CNN) for image classification tasks. Bayesian Optimization is a derivative-free global optimization method suitable for expensive black-box functions with continuous inputs and limited evaluation budgets. The BO algorithm leverages Gaussian Process regression and acquisition functions like Upper Confidence Bound (UCB) and Expected Improvement (EI) to identify optimal configurations effectively. Using the Ax and BOTorch frameworks, this work demonstrates the efficiency of BO in reducing the number of hyperparameter tuning trials while achieving competitive model performance. Experimental outcomes reveal that BO effectively balances exploration and exploitation, converging rapidly towards optimal settings for CNN architectures. This approach underlines the potential of BO in automating neural network tuning, contributing to improved accuracy and computational efficiency in machine learning pipelines.
这项研究调查了贝叶斯优化(BO)在神经网络超参数调优中的应用,特别针对图像分类任务中卷积神经网络(CNN)的增强。贝叶斯优化是一种无导数全局优化方法,适用于具有连续输入和有限评估预算的昂贵黑盒函数。BO算法利用高斯过程回归以及上置信界(UCB)和预期改进(EI)等采集函数来有效识别最优配置。借助Ax和BOTorch框架,本研究展示了在实现竞争性模型性能的同时减少超参数调优试验数量的有效性。实验结果表明,BO能够有效地平衡探索与利用,并快速收敛到CNN架构的最优设置。这种方法突显了BO在自动化神经网络调优中的潜力,有助于提高机器学习管道中的准确性和计算效率。
https://arxiv.org/abs/2410.21886
Trust and interpretability are crucial for the use of Artificial Intelligence (AI) in scientific research, but current models often operate as black boxes offering limited transparency and justifications for their outputs. We introduce AiSciVision, a framework that specializes Large Multimodal Models (LMMs) into interactive research partners and classification models for image classification tasks in niche scientific domains. Our framework uses two key components: (1) Visual Retrieval-Augmented Generation (VisRAG) and (2) domain-specific tools utilized in an agentic workflow. To classify a target image, AiSciVision first retrieves the most similar positive and negative labeled images as context for the LMM. Then the LMM agent actively selects and applies tools to manipulate and inspect the target image over multiple rounds, refining its analysis before making a final prediction. These VisRAG and tooling components are designed to mirror the processes of domain experts, as humans often compare new data to similar examples and use specialized tools to manipulate and inspect images before arriving at a conclusion. Each inference produces both a prediction and a natural language transcript detailing the reasoning and tool usage that led to the prediction. We evaluate AiSciVision on three real-world scientific image classification datasets: detecting the presence of aquaculture ponds, diseased eelgrass, and solar panels. Across these datasets, our method outperforms fully supervised models in low and full-labeled data settings. AiSciVision is actively deployed in real-world use, specifically for aquaculture research, through a dedicated web application that displays and allows the expert users to converse with the transcripts. This work represents a crucial step toward AI systems that are both interpretable and effective, advancing their use in scientific research and scientific discovery.
信任和可解释性对于人工智能(AI)在科学研究中的应用至关重要,但当前的模型往往作为黑盒运作,对输出提供的透明度和理由有限。我们介绍了AiSciVision框架,该框架将大型多模态模型(LMMs)专门化为交互式研究伙伴,并用于小众科学领域中图像分类任务的分类模型。我们的框架使用两个关键组件:(1) 视觉检索增强生成(VisRAG),以及 (2) 在代理工作流中使用的特定领域的工具。为了对目标图像进行分类,AiSciVision 首先检索与之最相似的正负标签图像作为LMM 的上下文。然后,LMM 代理积极选择并应用工具,在多轮次中操作和检查目标图像,细化其分析并在最终作出预测前得出结论。这些 VisRAG 和工具组件旨在模拟领域专家的过程,因为人类通常会将新数据与相似的示例进行比较,并使用专门的工具来操作和检查图像,然后得出结论。每次推理都会生成一个预测及其自然语言记录,该记录详细描述了导致预测的理由和工具使用情况。 我们在三个实际科学图像分类数据集上评估了AiSciVision:检测水产养殖池塘、病态海草以及太阳能电池板的存在与否。在这些数据集中,我们的方法在低标注和全标注数据设置中均优于完全监督模型。AiSciVision 已通过一个专用的网络应用程序在现实世界实际部署,特别是用于水产养殖研究,在此应用程序中显示并允许专家用户与记录对话。这项工作代表了向可解释且有效的AI系统迈出的重要一步,推动它们在科学研究和科学发现中的应用。
https://arxiv.org/abs/2410.21480
We present ProtoViT, a method for interpretable image classification combining deep learning and case-based reasoning. This method classifies an image by comparing it to a set of learned prototypes, providing explanations of the form ``this looks like that.'' In our model, a prototype consists of \textit{parts}, which can deform over irregular geometries to create a better comparison between images. Unlike existing models that rely on Convolutional Neural Network (CNN) backbones and spatially rigid prototypes, our model integrates Vision Transformer (ViT) backbones into prototype based models, while offering spatially deformed prototypes that not only accommodate geometric variations of objects but also provide coherent and clear prototypical feature representations with an adaptive number of prototypical parts. Our experiments show that our model can generally achieve higher performance than the existing prototype based models. Our comprehensive analyses ensure that the prototypes are consistent and the interpretations are faithful.
我们提出了ProtoViT,这是一种结合深度学习和基于案例推理的可解释图像分类方法。该方法通过将图像与一组已学习的原型进行比较来对其进行分类,并以“这看起来像那”的形式提供解释。在我们的模型中,一个原型由\textit{部分}组成,这些部分可以在不规则几何形状上变形,以便更好地对比图像。与依赖于卷积神经网络(CNN)基础架构和空间刚性原型的现有模型不同,我们的模型将视觉变压器(ViT)基础架构集成到基于原型的模型中,并提供了空间可变形的原型,这不仅适应了物体的几何变化,还提供了具有自适应数量的原型部分的一致且清晰的原型特征表示。实验表明,我们的模型通常能够比现有的基于原型的模型实现更高的性能。我们的全面分析确保了原型的一致性和解释的可靠性。
https://arxiv.org/abs/2410.20722
Test-time prompt tuning, which learns prompts online with unlabelled test samples during the inference stage, has demonstrated great potential by learning effective prompts on-the-fly without requiring any task-specific annotations. However, its performance often degrades clearly along the tuning process when the prompts are continuously updated with the test data flow, and the degradation becomes more severe when the domain of test samples changes continuously. We propose HisTPT, a Historical Test-time Prompt Tuning technique that memorizes the useful knowledge of the learnt test samples and enables robust test-time prompt tuning with the memorized knowledge. HisTPT introduces three types of knowledge banks, namely, local knowledge bank, hard-sample knowledge bank, and global knowledge bank, each of which works with different mechanisms for effective knowledge memorization and test-time prompt optimization. In addition, HisTPT features an adaptive knowledge retrieval mechanism that regularizes the prediction of each test sample by adaptively retrieving the memorized knowledge. Extensive experiments show that HisTPT achieves superior prompt tuning performance consistently while handling different visual recognition tasks (e.g., image classification, semantic segmentation, and object detection) and test samples from continuously changing domains.
测试时提示调优,在推理阶段使用未标记的测试样本在线学习提示,已经展示了巨大的潜力,能够即时学习有效的提示,而无需任何特定任务的注释。然而,当提示随着测试数据流不断更新时,其性能通常会在调优过程中明显下降,而且这种退化在测试样本领域持续变化的情况下会变得更加严重。我们提出了HisTPT(历史测试时提示调优技术),它记住已学测试样本中的有用知识,并利用这些记忆的知识实现稳健的测试时提示调优。HisTPT引入了三种类型的知识库,分别是局部知识库、难样本知识库和全局知识库,每种知识库都采用不同的机制进行有效知识记忆和测试时提示优化。此外,HisTPT具有自适应知识检索机制,通过自适应地检索已记住的知识来对每个测试样本的预测进行正则化。广泛的实验表明,在处理不同视觉识别任务(如图像分类、语义分割和目标检测)以及来自不断变化领域的测试样本时,HisTPT始终实现了卓越的提示调优性能。
https://arxiv.org/abs/2410.20346
This study introduces SLLMBO, an innovative framework that leverages Large Language Models (LLMs) for hyperparameter optimization (HPO), incorporating dynamic search space adaptability, enhanced parameter landscape exploitation, and a hybrid, novel LLM-Tree-structured Parzen Estimator (LLM-TPE) sampler. By addressing limitations in recent fully LLM-based methods and traditional Bayesian Optimization (BO), SLLMBO achieves more robust optimization. This comprehensive benchmarking evaluates multiple LLMs, including GPT-3.5-turbo, GPT-4o, Claude-Sonnet-3.5, and Gemini-1.5-flash, extending prior work beyond GPT-3.5 and GPT-4 and establishing SLLMBO as the first framework to benchmark a diverse set of LLMs for HPO. By integrating LLMs' established strengths in parameter initialization with the exploitation abilities demonstrated in this study, alongside TPE's exploration capabilities, the LLM-TPE sampler achieves a balanced exploration-exploitation trade-off, reduces API costs, and mitigates premature early stoppings for more effective parameter searches. Across 14 tabular tasks in classification and regression, the LLM-TPE sampler outperformed fully LLM-based methods and achieved superior results over BO methods in 9 tasks. Testing early stopping in budget-constrained scenarios further demonstrated competitive performance, indicating that LLM-based methods generally benefit from extended iterations for optimal results. This work lays the foundation for future research exploring open-source LLMs, reproducibility of LLM results in HPO, and benchmarking SLLMBO on complex datasets, such as image classification, segmentation, and machine translation.
这项研究介绍了SLLMBO,这是一个创新框架,利用大型语言模型(LLMs)进行超参数优化(HPO),集成了动态搜索空间适应性、增强的参数景观探索以及一种新颖的混合LLM-树结构化Parzen估计器(LLM-TPE)采样器。通过解决近期全基于LLM的方法和传统贝叶斯优化(BO)方法中的局限性,SLLMBO实现了更稳健的优化。这项全面的基准测试评估了多个LLMs,包括GPT-3.5-turbo、GPT-4o、Claude-Sonnet-3.5和Gemini-1.5-flash,将先前的工作范围从GPT-3.5和GPT-4扩展到更广泛的LLM,并确立SLLMBO为首个对多样化的LLMs进行HPO基准测试的框架。通过结合LLMs在参数初始化中的既有优势与本研究中展示的探索能力,以及TPE的探索能力,LLM-TPE采样器实现了探索与利用之间的平衡,降低了API成本,并减少了过早停止的情况,从而实现更有效的参数搜索。在分类和回归的14个表格任务上,LLM-TPE采样器的表现优于完全基于LLMs的方法,在9个任务中超过了BO方法。在预算受限场景下测试提前终止进一步展示了其竞争力,表明基于LLM的方法通常从延长迭代次数中受益以获得最佳结果。这项工作为未来的研究奠定了基础,涉及探索开源LLMs、HPO中的LLM结果可重复性以及将SLLMBO应用于复杂数据集(如图像分类、分割和机器翻译)的基准测试。
https://arxiv.org/abs/2410.20302
Optimization is critical for optimal performance in deep neural networks (DNNs). Traditional gradient-based methods often face challenges like local minima entrapment. This paper explores population-based metaheuristic optimization algorithms for image classification networks. We propose a novel approach integrating a two-stage training technique with population-based optimization algorithms incorporating local search capabilities. Our experiments demonstrate that the proposed method outperforms state-of-the-art gradient-based techniques, such as ADAM, in accuracy and computational efficiency, particularly with high computational complexity and numerous trainable parameters. The results suggest that our approach offers a robust alternative to traditional methods for weight optimization in convolutional neural networks (CNNs). Future work will explore integrating adaptive mechanisms for parameter tuning and applying the proposed method to other types of neural networks and real-time applications.
优化对于深度神经网络(DNN)的最优性能至关重要。传统的基于梯度的方法经常会遇到诸如陷入局部最小值等挑战。本文探讨了用于图像分类网络的基于种群的元启发式优化算法。我们提出了一种结合两阶段训练技术和集成局部搜索能力的基于种群的优化算法的新方法。实验结果表明,所提方法在准确率和计算效率上优于最先进的基于梯度的技术(如ADAM),尤其是在高计算复杂性和大量可训练参数的情况下。结果表明,我们的方法为卷积神经网络(CNN)中的权重优化提供了稳健的替代方案。未来的工作将探索自适应机制以调整参数,并将所提方法应用于其他类型的神经网络和实时应用中。
https://arxiv.org/abs/2410.20234
While the pretraining of Foundation Models (FMs) for remote sensing (RS) imagery is on the rise, models remain restricted to a few hundred million parameters. Scaling models to billions of parameters has been shown to yield unprecedented benefits including emergent abilities, but requires data scaling and computing resources typically not available outside industry R&D labs. In this work, we pair high-performance computing resources including Frontier supercomputer, America's first exascale system, and high-resolution optical RS data to pretrain billion-scale FMs. Our study assesses performance of different pretrained variants of vision Transformers across image classification, semantic segmentation and object detection benchmarks, which highlight the importance of data scaling for effective model scaling. Moreover, we discuss construction of a novel TIU pretraining dataset, model initialization, with data and pretrained models intended for public release. By discussing technical challenges and details often lacking in the related literature, this work is intended to offer best practices to the geospatial community toward efficient training and benchmarking of larger FMs.
虽然基于基础模型(FMs)的遥感(RS)图像预训练正在兴起,但当前的模型仍然局限于几个亿的参数量。将模型扩展到数十亿个参数已被证明能够带来前所未有的优势,包括产生新的能力,但这通常需要大量的数据和计算资源,这通常是除了工业研发实验室之外无法获得的。在这项工作中,我们将高性能计算资源,包括美国首个百亿亿次级系统Frontier超级计算机,以及高分辨率光学RS数据结合在一起,用于预训练数十亿参数规模的基础模型。我们的研究评估了不同版本的视觉Transformer在图像分类、语义分割和目标检测基准上的表现,这突显了有效扩展模型时数据扩大的重要性。此外,我们还讨论了一种新型TIU预训练数据集的构建、模型初始化,并计划公开发布相关的数据和预训练模型。通过讨论相关文献中经常缺乏的技术挑战和细节,本工作旨在为地理空间社区提供最佳实践,以实现更大规模基础模型的有效训练和基准测试。
https://arxiv.org/abs/2410.19965
This Paper presents an advanced approach for fine-tuning BiomedCLIP PubMedBERT, a multimodal model, to classify abnormalities in Video Capsule Endoscopy (VCE) frames, aiming to enhance diagnostic efficiency in gastrointestinal healthcare. By integrating the PubMedBERT language model with a Vision Transformer (ViT) to process endoscopic images, our method categorizes images into ten specific classes: angioectasia, bleeding, erosion, erythema, foreign body, lymphangiectasia, polyp, ulcer, worms, and normal. Our workflow incorporates image preprocessing and fine-tunes the BiomedCLIP model to generate high-quality embeddings for both visual and textual inputs, aligning them through similarity scoring for classification. Performance metrics, including classification, accuracy, recall, and F1 score, indicate the models strong ability to accurately identify abnormalities in endoscopic frames, showing promise for practical use in clinical diagnostics.
本文提出了一种先进的方法,用于微调多模态模型BiomedCLIP PubMedBERT,以分类视频胶囊内镜(VCE)帧中的异常情况,旨在提高胃肠医疗的诊断效率。通过将PubMedBERT语言模型与视觉变压器(ViT)集成来处理内窥镜图像,我们的方法可以将图像分为十个特定类别:血管扩张、出血、侵蚀、红斑、异物、淋巴管扩张、息肉、溃疡、寄生虫和正常。我们的工作流程包括图像预处理,并微调BiomedCLIP模型以生成高质量的视觉和文本输入嵌入,并通过相似度评分对齐这些输入进行分类。性能指标,包括分类准确性、召回率和F1分数,表明该模型具有准确识别内窥镜帧中异常情况的强大能力,在临床诊断中的实际应用前景广阔。
https://arxiv.org/abs/2410.19944
The escalating significance of information security has underscored the per-vasive role of encryption technology in safeguarding communication con-tent. Morse code, a well-established and effective encryption method, has found widespread application in telegraph communication and various do-mains. However, the transmission of Morse code images faces challenges due to diverse noises and distortions, thereby hindering comprehensive clas-sification outcomes. Existing methodologies predominantly concentrate on categorizing Morse code images affected by a single type of noise, neglecting the multitude of scenarios that noise pollution can generate. To overcome this limitation, we propose a novel two-stage approach, termed the Noise Adaptation Network (NANet), for Morse code image classification. Our method involves exclusive training on pristine images while adapting to noisy ones through the extraction of critical information unaffected by noise. In the initial stage, we introduce a U-shaped network structure designed to learn representative features and denoise images. Subsequently, the second stage employs a deep convolutional neural network for classification. By leveraging the denoising module from the first stage, our approach achieves enhanced accuracy and robustness in the subsequent classification phase. We conducted an evaluation of our approach on a diverse dataset, encom-passing Gaussian, salt-and-pepper, and uniform noise variations. The results convincingly demonstrate the superiority of our methodology over existing approaches. The datasets are available on this https URL
信息安全性的重要性的提升凸显了加密技术在保护通信内容方面无所不在的作用。摩尔斯电码作为一种成熟且有效的加密方法,在电报通讯和各个领域中得到了广泛应用。然而,由于各种噪声和失真,摩尔斯电码图像的传输面临着挑战,这阻碍了全面分类结果的实现。现有的方法主要集中在对受单一类型噪声影响的摩尔斯电码图像进行分类,忽视了噪声污染可能产生的多种场景。为克服这一限制,我们提出了一种新的两阶段方法——噪声适应网络(NANet),用于摩尔斯电码图像分类。我们的方法包括在纯净图像上的专门训练,并通过提取不受噪声影响的关键信息来适应噪声图像。在初始阶段,我们引入了一个U形网络结构,旨在学习代表性特征并去除噪声。随后,在第二阶段中使用深度卷积神经网络进行分类。借助第一阶段的去噪模块,我们的方法在后续分类阶段实现了更高的准确性和鲁棒性。我们在包含高斯、椒盐和均匀噪声变化的多样化数据集上评估了该方法。结果令人信服地证明了我们方法相较于现有方法的优势。数据集可在以下链接获取:[https URL]
https://arxiv.org/abs/2410.19180
Several medical Multimodal Large Languange Models (MLLMs) have been developed to address tasks involving visual images with textual instructions across various medical modalities, achieving impressive results. Most current medical generalist models are region-agnostic, treating the entire image as a holistic representation. However, they struggle to identify which specific regions they are focusing on when generating a this http URL mimic the behavior of doctors, who typically begin by reviewing the entire image before concentrating on specific regions for a thorough evaluation, we aim to enhance the capability of medical MLLMs in understanding anatomical regions within entire medical scans. To achieve it, we first formulate Region-Centric tasks and construct a large-scale dataset, MedRegInstruct, to incorporate regional information into training. Combining our collected dataset with other medical multimodal corpora for training, we propose a Region-Aware medical MLLM, MedRegA, which is the first bilingual generalist medical AI system to simultaneously handle image-level and region-level medical vision-language tasks across a broad range of modalities. Our MedRegA not only enables three region-centric tasks, but also achieves the best performance for visual question answering, report generation and medical image classification over 8 modalities, showcasing significant versatility. Experiments demonstrate that our model can not only accomplish powerful performance across various medical vision-language tasks in bilingual settings, but also recognize and detect structures in multimodal medical scans, boosting the interpretability and user interactivity of medical MLLMs. Our project page is this https URL.
几个医疗多模态大型语言模型(MLLMs)已经被开发出来,以处理涉及视觉图像与文本指令的各种医学模态任务,并取得了令人印象深刻的结果。目前大多数医疗通用型模型是区域无关的,将整个图像视为一个整体表示。然而,它们在生成输出时难以识别出其具体关注的是哪些特定区域。为了模仿医生的行为——通常先全面审查整个图像再集中于某些特定区域进行深入评估——我们旨在提升医学MLLMs理解整幅医疗扫描中解剖学区域的能力。为此,我们首先定义了以区域为中心的任务,并构建了一个大规模数据集MedRegInstruct,将区域信息整合到训练过程中。结合我们收集的数据集与其他医疗多模态语料库进行训练,我们提出了一种区域感知的医学MLLM——MedRegA,这是首个能够同时处理图像级和区域级医疗视觉-语言任务的双语通用型医学AI系统,并覆盖了广泛的模态。我们的MedRegA不仅实现了三项以区域为中心的任务,还在8个模态下的视觉问题回答、报告生成以及医学影像分类等任务中取得了最佳性能,展示了显著的多功能性。实验表明,我们的模型不仅能跨多种医疗视觉-语言任务在双语设置下实现强大的表现,还能识别和检测多模态医疗扫描中的结构,增强了医学MLLMs的可解释性和用户交互性。我们的项目页面是[这个链接](this https URL)。
https://arxiv.org/abs/2410.18387
Large pre-trained models have achieved notable success across a range of downstream tasks. However, recent research shows that a type of adversarial attack ($\textit{i.e.,}$ backdoor attack) can manipulate the behavior of machine learning models through contaminating their training dataset, posing significant threat in the real-world application of large pre-trained model, especially for those customized models. Therefore, addressing the unique challenges for exploring vulnerability of pre-trained models is of paramount importance. Through empirical studies on the capability for performing backdoor attack in large pre-trained models ($\textit{e.g.,}$ ViT), we find the following unique challenges of attacking large pre-trained models: 1) the inability to manipulate or even access large training datasets, and 2) the substantial computational resources required for training or fine-tuning these models. To address these challenges, we establish new standards for an effective and feasible backdoor attack in the context of large pre-trained models. In line with these standards, we introduce our EDT model, an \textbf{E}fficient, \textbf{D}ata-free, \textbf{T}raining-free backdoor attack method. Inspired by model editing techniques, EDT injects an editing-based lightweight codebook into the backdoor of large pre-trained models, which replaces the embedding of the poisoned image with the target image without poisoning the training dataset or training the victim model. Our experiments, conducted across various pre-trained models such as ViT, CLIP, BLIP, and stable diffusion, and on downstream tasks including image classification, image captioning, and image generation, demonstrate the effectiveness of our method. Our code is available in the supplementary material.
大规模预训练模型在多种下游任务中取得了显著的成功。然而,最近的研究表明,一种特定类型的对抗攻击(即后门攻击)可以通过污染训练数据集来操纵机器学习模型的行为,在大规模预训练模型的实际应用中构成了重大威胁,尤其是对那些定制化模型而言。因此,探索和解决这些预训练模型脆弱性的独特挑战至关重要。通过在大型预训练模型(例如ViT)上的实证研究发现针对这些模型进行后门攻击的独特挑战包括:1) 无法操纵或甚至访问大规模的训练数据集;2) 对这些模型进行训练或微调所需的大量计算资源。为了解决这些挑战,我们确立了针对大规模预训练模型中有效且可行的后门攻击的新标准。根据这些标准,我们介绍了一种新的方法——EDT(\textbf{E}fficient, \textbf{D}ata-free, \textbf{T}raining-free)后门攻击方法。受模型编辑技术启发,EDT向大型预训练模型的后门中注入了一个基于编辑的轻量级代码本,它在不污染训练数据集或训练目标模型的情况下将中毒图像的嵌入替换为目标图像。我们的实验涵盖了包括ViT、CLIP、BLIP和稳定扩散在内的各种预训练模型,并涉及图像分类、图像描述生成和图像生成等下游任务,结果证明了我们方法的有效性。我们的代码可以在补充材料中获得。
https://arxiv.org/abs/2410.18267
A solar active region can significantly disrupt the Sun Earth space environment, often leading to severe space weather events such as solar flares and coronal mass ejections. As a consequence, the automatic classification of active region groups is the crucial starting point for accurately and promptly predicting solar activity. This study presents our results concerned with the application of deep learning techniques to the classification of active region cutouts based on the Mount Wilson classification scheme. Specifically, we have explored the latest advancements in image classification architectures, from Convolutional Neural Networks to Vision Transformers, and reported on their performances for the active region classification task, showing that the crucial point for their effectiveness consists in a robust training process based on the latest advances in the field.
一个太阳活动区可以显著扰乱日地空间环境,常常导致诸如太阳耀斑和日冕物质抛射等严重空间天气事件。因此,自动分类活动区域是准确及时预测太阳活动的关键起点。本研究介绍了我们关于将深度学习技术应用于基于Mount Wilson分类方案的活动区域片段分类的研究成果。具体而言,我们探讨了图像分类架构的最新进展,从卷积神经网络到视觉变压器,并报告了它们在活动区域分类任务上的表现,表明其有效性关键在于基于领域最新进展的强大训练过程。
https://arxiv.org/abs/2410.17816
The Pap smear is a screening method for early cervical cancer diagnosis. The selection of the right optimizer in the convolutional neural network (CNN) model is key to the success of the CNN in image classification, including the classification of cervical cancer Pap smear images. In this study, stochastic gradient descent (SGD), RMSprop, Adam, AdaGrad, AdaDelta, Adamax, and Nadam optimizers were used to classify cervical cancer Pap smear images from the SipakMed dataset. Resnet-18, Resnet-34, and VGG-16 are the CNN architectures used in this study, and each architecture uses a transfer-learning model. Based on the test results, we conclude that the transfer learning model performs better on all CNNs and optimization techniques and that in the transfer learning model, the optimization has little influence on the training of the model. Adamax, with accuracy values of 72.8% and 66.8%, had the best accuracy for the VGG-16 and Resnet-18 architectures, respectively. Resnet-34 had 54.0%. This is 0.034% lower than Nadam. Overall, Adamax is a suitable optimizer for CNN in cervical cancer classification on Resnet-18, Resnet-34, and VGG-16 architectures. This study provides new insights into the configuration of CNN models for Pap smear image analysis.
宫颈抹片检查是一种早期诊断宫颈癌的筛查方法。在卷积神经网络(CNN)模型中,选择正确的优化器对于图像分类的成功至关重要,包括宫颈癌抹片图像的分类。在这项研究中,使用了随机梯度下降(SGD)、RMSprop、Adam、AdaGrad、AdaDelta、Adamax 和 Nadam 优化器来对 SipakMed 数据集中的宫颈癌抹片图像进行分类。Resnet-18、Resnet-34 和 VGG-16 是本研究中使用的 CNN 架构,每个架构都采用了迁移学习模型。根据测试结果,我们得出结论:在所有 CNN 和优化技术上,迁移学习模型的表现更优,并且在迁移学习模型中,优化对模型训练的影响较小。对于 VGG-16 和 Resnet-18 架构,Adamax 的准确率分别为 72.8% 和 66.8%,表现最好。Resnet-34 准确率为 54.0%,比 Nadam 低 0.034%。总体而言,对于 Resnet-18、Resnet-34 和 VGG-16 架构上的宫颈癌分类任务,Adamax 是一个合适的优化器。这项研究为配置用于抹片图像分析的 CNN 模型提供了新的见解。
https://arxiv.org/abs/2410.17735