In the realm of skin lesion image classification, the intricate spatial and semantic features pose significant challenges for conventional Convolutional Neural Network (CNN)-based methodologies. These challenges are compounded by the imbalanced nature of skin lesion datasets, which hampers the ability of models to learn minority class features effectively. Despite augmentation strategies, such as those using Generative Adversarial Networks (GANs), previous attempts have not fully addressed these complexities. This study introduces an innovative approach by integrating Graph Neural Networks (GNNs) with Capsule Networks to enhance classification performance. GNNs, known for their proficiency in handling graph-structured data, offer an advanced mechanism for capturing complex patterns and relationships beyond the capabilities of traditional CNNs. Capsule Networks further contribute by providing superior recognition of spatial hierarchies within images. Our research focuses on evaluating and enhancing the Tiny Pyramid Vision GNN (Tiny Pyramid ViG) architecture by incorporating it with a Capsule Network. This hybrid model was applied to the MNIST:HAM10000 dataset, a comprehensive skin lesion dataset designed for benchmarking classification models. After 75 epochs of training, our model achieved a significant accuracy improvement, reaching 89.23% and 95.52%, surpassing established benchmarks such as GoogLeNet (83.94%), InceptionV3 (86.82%), MobileNet V3 (89.87%), EfficientNet-7B (92.07%), ResNet18 (92.22%), ResNet34 (91.90%), ViT-Base (73.70%), and IRv2-SA (93.47%) on the same dataset. This outcome underscores the potential of our approach in overcoming the inherent challenges of skin lesion classification, contributing to the advancement of image-based diagnosis in dermatology.
在皮肤病变图像分类领域,传统的卷积神经网络(CNN)方法面临着显著的挑战。这些挑战由于皮肤病变数据集的不平衡性质而进一步加剧,使得模型学习少数民族类特征的效果受到限制。尽管采用增强策略,如使用生成对抗网络(GANs)的策略,但以前的方法并没有完全解决这些复杂性。这项研究通过将图神经网络(GNNs)与胶囊网络(Capsule Networks)相结合,引入了一种创新的方法来增强分类性能。GNNs以其在处理图状数据方面的卓越性能而闻名,提供了一种超越传统CNN能力的先进机制来捕捉复杂模式和关系。胶囊网络通过提供对图像中空间层次结构的卓越识别,进一步增强了这种能力。 我们的研究重点是在Capsule Network的基础上评估和优化Tiny Pyramid Vision GNN(Tiny Pyramid ViG)架构。将这种混合模型应用于MNIST:HAM10000数据集,这是一个用于基准测试分类模型的全面皮肤病变数据集。经过75个训练周期后,我们的模型在准确率方面取得了显著的提高,达到89.23%和95.52%,超过了 established benchmarks,如GoogleNet(83.94%)、InceptionV3(86.82%)、MobileNet V3(89.87%)、EfficientNet-7B(92.07%)、ResNet18(92.22%)、ResNet34(91.90%)、ViT-Base(73.70%)和IRv2-SA(93.47%)。这一结果表明,我们的方法在克服皮肤病变分类固有挑战方面具有潜力,为皮肤病学图像诊断的发展做出了贡献。
https://arxiv.org/abs/2403.12009
In recent times, the fields of high-energy physics (HEP) experimentation and phenomenological studies have seen the integration of machine learning (ML) and its specialized branch, deep learning (DL). This survey offers a comprehensive assessment of these applications within the realm of various DL approaches. The initial segment of the paper introduces the fundamentals encompassing diverse particle physics types and establishes criteria for evaluating particle physics in tandem with learning models. Following this, a comprehensive taxonomy is presented for representing HEP images, encompassing accessible datasets, intricate details of preprocessing techniques, and methods of feature extraction and selection. Subsequently, the focus shifts to an exploration of available artificial intelligence (AI) models tailored to HEP images, along with a concentrated examination of HEP image classification pertaining to Jet particles. Within this review, a profound investigation is undertaken into distinct ML and DL proposed state-of-the art (SOTA) techniques, underscoring their implications for HEP inquiries. The discussion delves into specific applications in substantial detail, including Jet tagging, Jet tracking, particle classification, and more. The survey culminates with an analysis concerning the present status of HEP grounded in DL methodologies, encompassing inherent challenges and prospective avenues for future research endeavors.
近年来,高能物理(HEP)实验和现象学研究领域已经将机器学习(ML)及其专门领域——深度学习(DL)进行了整合。这篇调查对各种DL方法的应用进行了全面的评估。论文的第一部分概述了各种粒子物理类型的基本原理,并建立了评估粒子物理学与学习模型的标准。接着,对HEP图像的全面分类被呈现出来,包括可访问的数据集、预处理技术的详细信息以及特征提取和选择的常用方法。随后,重点转向了对针对HEP图像的人工智能(AI)模型的探索,以及关于Jet粒子的HEP图像分类的深入研究。在本文的回顾中,对最先进的ML和DL建议技术进行了深入调查,强调其对HEP研究的启示。讨论深入探讨了具体的应用,包括Jet标记、Jet跟踪、粒子分类等。调查最后对基于DL方法论的HEP现状进行了分析,涵盖了固有挑战和未来的研究趋势。
https://arxiv.org/abs/2403.11934
Despite the availability of large datasets for tasks like image classification and image-text alignment, labeled data for more complex recognition tasks, such as detection and segmentation, is less abundant. In particular, for instance segmentation annotations are time-consuming to produce, and the distribution of instances is often highly skewed across classes. While semi-supervised teacher-student distillation methods show promise in leveraging vast amounts of unlabeled data, they suffer from miscalibration, resulting in overconfidence in frequently represented classes and underconfidence in rarer ones. Additionally, these methods encounter difficulties in efficiently learning from a limited set of examples. We introduce a dual-strategy to enhance the teacher model's training process, substantially improving the performance on few-shot learning. Secondly, we propose a calibration correction mechanism that that enables the student model to correct the teacher's calibration errors. Using our approach, we observed marked improvements over a state-of-the-art supervised baseline performance on the LVIS dataset, with an increase of 2.8% in average precision (AP) and 10.3% gain in AP for rare classes.
尽管像图像分类和图像文本对齐等任务已经有了大量的大数据集,但用于更复杂的识别任务(例如检测和分割)的有标签数据相对较少。特别是,例如分割注释工作量很大,并且类别的分布通常在类之间高度不对称。虽然半监督的教师-学生蒸馏方法在利用大量未标记数据方面表现出前景,但它们存在参数估计不准确,导致对常见类别的过度自信,对罕见类别的过度自信。此外,这些方法在从有限的样本中高效学习方面也存在困难。我们引入了一种双重策略来增强教师模型的训练过程,在几shot学习上取得了显著的改善。 其次,我们提出了一个校准修正机制,使得学生模型可以纠正教师的校准误差。使用我们提出的方法,我们在LVIS数据集上观察到了在现有监督基线性能上显著的改善,平均精度(AP)增加了2.8%,稀有类别的AP增加了10.3%。
https://arxiv.org/abs/2403.11675
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners. These requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify two key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. To address them, we propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we use LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. GS-LoRA is effective, parameter-efficient, data-efficient, and easy to implement. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that GS-LoRA manages to forget specific classes with minimal impact on other classes. Codes will be released on \url{this https URL}.
为了隐私和安全问题,现在需要从预训练视觉模型中清除不需要的信息变得越来越明显。在现实场景中,清除请求可能来自用户和模型所有者,通常会形成一个序列。因此,在这种情况下,我们期望在保持其余信息的同时,持续从预训练模型中移除特定的信息。我们将这个问题称为持续遗忘,并确定两个关键挑战。 (i) 对于不需要的知识,高效的删除至关重要。 (ii) 对于保留的知识,遗忘过程所带来的影响应最小化。 为了解决这些问题,我们提出了Group Sparse LoRA (GS-LoRA)。具体来说,对于(i),我们使用LoRA模块对每个遗忘任务独立微调Transformer块中的FFN层,而对于(ii),采用简单的组稀疏 regularization,使自动选择特定LoRA组并消除其他组。GS-LoRA有效、参数效率高、数据效率高,并且易于实现。我们在面部识别、目标检测和图像分类上进行广泛的实验,并证明GS-LoRA能够以最小的影响对待定类别的其他类别。代码发布在[这个链接](https:// this URL)。
https://arxiv.org/abs/2403.11530
Visual State Space Model (VMamba) has recently emerged as a promising architecture, exhibiting remarkable performance in various computer vision tasks. However, its robustness has not yet been thoroughly studied. In this paper, we delve into the robustness of this architecture through comprehensive investigations from multiple perspectives. Firstly, we investigate its robustness to adversarial attacks, employing both whole-image and patch-specific adversarial attacks. Results demonstrate superior adversarial robustness compared to Transformer architectures while revealing scalability weaknesses. Secondly, the general robustness of VMamba is assessed against diverse scenarios, including natural adversarial examples, out-of-distribution data, and common corruptions. VMamba exhibits exceptional generalizability with out-of-distribution data but shows scalability weaknesses against natural adversarial examples and common corruptions. Additionally, we explore VMamba's gradients and back-propagation during white-box attacks, uncovering unique vulnerabilities and defensive capabilities of its novel components. Lastly, the sensitivity of VMamba to image structure variations is examined, highlighting vulnerabilities associated with the distribution of disturbance areas and spatial information, with increased susceptibility closer to the image center. Through these comprehensive studies, we contribute to a deeper understanding of VMamba's robustness, providing valuable insights for refining and advancing the capabilities of deep neural networks in computer vision applications.
视觉状态空间模型(VMamba)最近 emerged 成为一个有前途的架构,在各种计算机视觉任务中表现出惊人的性能。然而,它的鲁棒性尚未被充分研究。在本文中,我们通过多个角度进行全面调查,深入研究了该架构的鲁棒性。首先,我们研究了它对对抗攻击的鲁棒性,包括整张图像和补丁特定的对抗攻击。结果表明,与Transformer架构相比,VMamba具有卓越的对抗鲁棒性,同时揭示了可扩展性的弱点。其次,我们对VMamba的通用鲁棒性进行了评估,包括自然对抗实例、离散数据和常见错误。VMamba在离散数据上表现出非凡的泛化能力,但与自然对抗实例和常见错误相比,存在可扩展性弱点。此外,我们研究了VMamba在白盒攻击期间的梯度和反向传播,揭示了其新组件的独特漏洞和防御能力。最后,我们研究了VMamba对图像结构变化敏感性,强调了与分布干扰区域和空间信息相关的漏洞,中心区域的可塑性增加,导致对图像中心的鲁棒性降低。通过这些全面研究,我们为更深入地理解VMamba的鲁棒性做出了贡献,为改进和发展计算机视觉应用中的深度神经网络提供了宝贵的洞见。
https://arxiv.org/abs/2403.10935
The proliferation of digital images and the advancements in deep learning have paved the way for innovative solutions in various domains, especially in the field of image classification. Our project presents an in-depth study and implementation of an image classification system specifically tailored to identify and classify images of Indian cities. Drawing from an extensive dataset, our model classifies images into five major Indian cities: Ahmedabad, Delhi, Kerala, Kolkata, and Mumbai to recognize the distinct features and characteristics of each city/state. To achieve high precision and recall rates, we adopted two approaches. The first, a vanilla Convolutional Neural Network (CNN) and then we explored the power of transfer learning by leveraging the VGG16 model. The vanilla CNN achieved commendable accuracy and the VGG16 model achieved a test accuracy of 63.6%. Evaluations highlighted the strengths and potential areas of improvement, positioning our model as not only competitive but also scalable for broader applications. With an emphasis on open-source ethos, our work aims to contribute to the community, encouraging further development and diverse applications. Our findings demonstrate the potential applications in tourism, urban planning, and even real-time location identification systems, among others.
数字图像的普及和深度学习的进步为各个领域创新解决方案奠定了道路,特别是在图像分类领域。我们的项目针对性地研究并实现了一个专为识别和分类印度城市图像的图像分类系统。从广泛的训练数据集中,我们的模型将图像分为五个印度主要城市:艾哈迈达巴德、德里、喀拉拉、库尔库塔和孟买,以识别每个城市的独特特征和特点。为了实现高精度和召回率,我们采用了两种方法。第一种是传统的卷积神经网络(CNN),然后我们通过利用VGG16模型探讨了迁移学习的优势。传统的CNN获得了可观的准确率,而VGG16模型的测试准确率为63.6%。评估表明了我们的模型的优势和需要改进的领域,将我们的模型定位为不仅是竞争性的,而且具有更广泛应用的可扩展性。 强调开放源代码精神,我们的工作旨在为社区做出贡献,鼓励进一步发展和各种应用。我们的研究结果表明,在旅游、城市规划等领域,以及实时位置识别系统等方面,它们具有潜在应用价值。
https://arxiv.org/abs/2403.10912
Cytology image segmentation is quite challenging due to its complex cellular structure and multiple overlapping regions. On the other hand, for supervised machine learning techniques, we need a large amount of annotated data, which is costly. In recent years, late fusion techniques have given some promising performances in the field of image classification. In this paper, we have explored a fuzzy-based late fusion techniques for cytology image segmentation. This fusion rule integrates three traditional semantic segmentation models UNet, SegNet, and PSPNet. The technique is applied on two cytology image datasets, i.e., cervical cytology(HErlev) and breast cytology(JUCYT-v1) image datasets. We have achieved maximum MeanIoU score 84.27% and 83.79% on the HErlev dataset and JUCYT-v1 dataset after the proposed late fusion technique, respectively which are better than that of the traditional fusion rules such as average probability, geometric mean, Borda Count, etc. The codes of the proposed model are available on GitHub.
细胞学图像分割由于其复杂的细胞结构和多个重叠的区域而变得非常具有挑战性。另一方面,对于监督机器学习方法,我们需要大量注释数据,这很昂贵。近年来,晚期融合技术在图像分类领域已经给出了一些有前景的性能。在本文中,我们探讨了一种基于模糊的晚期融合技术来进行细胞学图像分割。这种融合规则整合了三种传统语义分割模型UNet、SegNet和PSPNet。该技术应用于两个细胞学图像数据集,即宫颈细胞学(HErlev)和乳腺细胞学(JUCYT-v1)图像数据集。在所提出的晚期融合技术应用于HErlev数据集和JUCYT-v1数据集后,我们分别获得了最大MeanIoU分数84.27%和83.79%,均优于传统融合规则(如平均概率、几何平均、Borda计数等)。所提出模型的代码可以在GitHub上找到。
https://arxiv.org/abs/2403.10884
Histopathological whole slide image (WSI) analysis with deep learning has become a research focus in computational pathology. The current paradigm is mainly based on multiple instance learning (MIL), in which approaches with Transformer as the backbone are well discussed. These methods convert WSI tasks into sequence tasks by representing patches as tokens in the WSI sequence. However, the feature complexity brought by high heterogeneity and the ultra-long sequences brought by gigapixel size makes Transformer-based MIL suffer from the challenges of high memory consumption, slow inference speed, and lack of performance. To this end, we propose a retentive MIL method called RetMIL, which processes WSI sequences through hierarchical feature propagation structure. At the local level, the WSI sequence is divided into multiple subsequences. Tokens of each subsequence are updated through a parallel linear retention mechanism and aggregated utilizing an attention layer. At the global level, subsequences are fused into a global sequence, then updated through a serial retention mechanism, and finally the slide-level representation is obtained through a global attention pooling. We conduct experiments on two public CAMELYON and BRACS datasets and an public-internal LUNG dataset, confirming that RetMIL not only achieves state-of-the-art performance but also significantly reduces computational overhead. Our code will be accessed shortly.
病理学全切片图像(WSI)分析与深度学习已成为计算病理学的重点研究方法。当前的范式主要基于多个实例学习(MIL),其中用Transformer作为骨架的方法得到了详细讨论。这些方法通过将 WSI 任务转换为 WSI 序列中的标记来表示补丁。然而,高异质性和巨型像素大小带来的特征复杂性以及超长序列带来的延迟,使得基于 Transformer 的 MIL 存在高内存消耗、推理速度慢以及性能差等问题。为此,我们提出了一个记忆式 MIL 方法,名为 RetMIL,通过具有层次结构的特征传播结构处理 WSI 序列。在局部层面, WSI 序列被分成多个子序列。每个子序列的标记通过并行线性保留机制更新,并利用注意力层进行聚合。在全局层面,子序列被融合为一个全局序列,然后通过串行保留机制进行更新,最后通过全局注意力池化获得切片级表示。我们对两个公共的 CAMELYON 和 BRACS 数据集和一个公共内部 LUNG 数据集进行了实验,证实了 RetMIL 不仅实现了最先进的性能,而且显著减少了计算开销。我们的代码将很快被访问。
https://arxiv.org/abs/2403.10858
Semi-supervised learning (SSL) seeks to enhance task performance by training on both labeled and unlabeled data. Mainstream SSL image classification methods mostly optimize a loss that additively combines a supervised classification objective with a regularization term derived solely from unlabeled data. This formulation neglects the potential for interaction between labeled and unlabeled images. In this paper, we introduce InterLUDE, a new approach to enhance SSL made of two parts that each benefit from labeled-unlabeled interaction. The first part, embedding fusion, interpolates between labeled and unlabeled embeddings to improve representation learning. The second part is a new loss, grounded in the principle of consistency regularization, that aims to minimize discrepancies in the model's predictions between labeled versus unlabeled inputs. Experiments on standard closed-set SSL benchmarks and a medical SSL task with an uncurated unlabeled set show clear benefits to our approach. On the STL-10 dataset with only 40 labels, InterLUDE achieves 3.2% error rate, while the best previous method reports 14.9%.
半监督学习(SSL)旨在通过同时训练有标签和无标签数据来提高任务性能。主流的 SSL 图像分类方法主要优化一个将监督分类目标与仅来自无标签数据的正则化项相结合的损失。这种表示法忽略了标签和无标签图像之间可能存在的交互作用。在本文中,我们引入了 InterLUDE,一种由两个部分组成的新的 SSL 增强方法,每个部分都从有标签和无标签的交互中受益。第一部分是嵌入融合,将有标签和无标签嵌入之间进行平滑处理以提高表示学习。第二部分是基于一致正则化原则的新损失,旨在最小化模型预测的有标签和无标签输入之间的差异。在标准的关闭集 SSL 基准测试和带无标签数据的医学 SSL 任务上进行实验,我们的方法显示出明显的优势。在 STL-10 数据集上,仅包含 40 个标签,InterLUDE 实现 3.2% 的误差率,而最佳先前方法报告的误差率为 14.9%。
https://arxiv.org/abs/2403.10658
Real-world vision models in dynamic environments face rapid shifts in domain distributions, leading to decreased recognition performance. Continual test-time adaptation (CTTA) directly adjusts a pre-trained source discriminative model to these changing domains using test data. A highly effective CTTA method involves applying layer-wise adaptive learning rates, and selectively adapting pre-trained layers. However, it suffers from the poor estimation of domain shift and the inaccuracies arising from the pseudo-labels. In this work, we aim to overcome these limitations by identifying layers through the quantification of model prediction uncertainty without relying on pseudo-labels. We utilize the magnitude of gradients as a metric, calculated by backpropagating the KL divergence between the softmax output and a uniform distribution, to select layers for further adaptation. Subsequently, for the parameters exclusively belonging to these selected layers, with the remaining ones frozen, we evaluate their sensitivity in order to approximate the domain shift, followed by adjusting their learning rates accordingly. Overall, this approach leads to a more robust and stable optimization than prior approaches. We conduct extensive image classification experiments on CIFAR-10C, CIFAR-100C, and ImageNet-C and demonstrate the efficacy of our method against standard benchmarks and prior methods.
动态环境中,实时视觉模型面临着领域分布的快速变化,导致识别性能下降。持续测试时间调整(CTTA)直接使用测试数据将预训练的源判别模型调整到这些变化领域。一种高效的CTTA方法涉及逐层自适应学习率应用和选择性地适应预训练层。然而,它受到对领域转移估计的准确性和伪标签引起的误差的不准确性。在本文中,我们通过不依赖伪标签的模型预测不确定性量化来识别层,以进一步调整模型。接着,我们将选定的层的参数置零,然后对它们进行敏感性评估,以近似领域转移,最后相应地调整学习率。总体而言,这种方法比先前的方法更健壮和稳定。我们在CIFAR-10C、CIFAR-100C和ImageNet-C上进行了广泛的图像分类实验,并证明了我们的方法与标准基准和先前的方法的优越性。
https://arxiv.org/abs/2403.10650
Training a linear classifier or lightweight model on top of pretrained vision model outputs, so-called 'frozen features', leads to impressive performance on a number of downstream few-shot tasks. Currently, frozen features are not modified during training. On the other hand, when networks are trained directly on images, data augmentation is a standard recipe that improves performance with no substantial overhead. In this paper, we conduct an extensive pilot study on few-shot image classification that explores applying data augmentations in the frozen feature space, dubbed 'frozen feature augmentation (FroFA)', covering twenty augmentations in total. Our study demonstrates that adopting a deceptively simple pointwise FroFA, such as brightness, can improve few-shot performance consistently across three network architectures, three large pretraining datasets, and eight transfer datasets.
在预训练的视觉模型输出上训练线性分类器或轻量级模型,所谓的“冻点特征”,在许多下游的少样本任务上表现出令人印象深刻的性能。目前,在训练过程中不会修改冻点特征。另一方面,当网络直接在图像上训练时,数据增强是一个标准的配方,可以提高性能而不会产生实质性的开销。在本文中,我们对少样本图像分类进行了一项广泛的先导研究,探讨了在冻点特征空间中应用数据增强,称之为“冻点特征增强(FroFA)”,包括总共二十个增强。我们的研究证明了,采用看似简单的点式FroFA,例如亮度,可以显著提高三个网络架构、三个大型预训练数据集和八个传输数据集的少样本性能。
https://arxiv.org/abs/2403.10519
How well the heart is functioning can be quantified through measurements of myocardial deformation via echocardiography. Clinical assessment of cardiac function is generally focused on global indices of relative shortening, however, territorial, and segmental strain indices have shown to be abnormal in regions of myocardial disease, such as scar. In this work, we propose a single framework to predict myocardial disease substrates at global, territorial, and segmental levels using regional myocardial strain traces as input to a convolutional neural network (CNN)-based classification algorithm. An anatomically meaningful representation of the input data from the clinically standard bullseye representation to a multi-channel 2D image is proposed, to formulate the task as an image classification problem, thus enabling the use of state-of-the-art neural network configurations. A Fully Convolutional Network (FCN) is trained to detect and localize myocardial scar from regional left ventricular (LV) strain patterns. Simulated regional strain data from a controlled dataset of virtual patients with varying degrees and locations of myocardial scar is used for training and validation. The proposed method successfully detects and localizes the scars on 98% of the 5490 left ventricle (LV) segments of the 305 patients in the test set using strain traces only. Due to the sparse existence of scar, only 10% of the LV segments in the virtual patient cohort have scar. Taking the imbalance into account, the class balanced accuracy is calculated as 95%. The performance is reported on global, territorial, and segmental levels. The proposed method proves successful on the strain traces of the virtual cohort and offers the potential to solve the regional myocardial scar detection problem on the strain traces of the real patient cohorts.
心脏功能的好坏可以通过通过超声心动图测量的心肌变形来量化。通常,临床评估心脏功能关注的是相对缩短的指标,然而,领域和片段应变指标在心肌病区域异常,例如瘢痕。在这项工作中,我们提出了一个用于预测基于卷积神经网络(CNN)的全面心肌病亚临床水平的方法,该方法使用区域心肌应变迹作为输入来训练卷积神经网络分类算法。我们提出了一个解剖学上有意义的心肌输入数据的从一个标准的 bullseye表示到多通道 2D 图像的转换,将任务表述为图像分类问题,从而使最先进的神经网络配置得以应用。 完全卷积网络(FCN)从区域左心室(LV)应变模式中训练来检测和定位心肌瘢痕。用于训练和验证的模拟区域应变数据来自于具有不同程度和位置心肌瘢痕的虚拟患者数据集。只使用应变迹成功检测和定位了测试集中305名患者中的98%的左心室(LV)段的瘢痕。由于瘢痕存在稀疏性,虚拟患者队列中只有10%的LV段有瘢痕。考虑到不平衡,计算得到分类平衡准确度为95%。在全局、领土和片段水平上报告性能。我们的方法在虚拟队列的应变迹上表现成功,并有望解决实际患者队列中区域心肌瘢痕检测问题。
https://arxiv.org/abs/2403.10291
The task of few-shot image classification and segmentation (FS-CS) involves classifying and segmenting target objects in a query image, given only a few examples of the target classes. We introduce the Vision-Instructed Segmentation and Evaluation (VISE) method that transforms the FS-CS problem into the Visual Question Answering (VQA) problem, utilising Vision-Language Models (VLMs), and addresses it in a training-free manner. By enabling a VLM to interact with off-the-shelf vision models as tools, the proposed method is capable of classifying and segmenting target objects using only image-level labels. Specifically, chain-of-thought prompting and in-context learning guide the VLM to answer multiple-choice questions like a human; vision models such as YOLO and Segment Anything Model (SAM) assist the VLM in completing the task. The modular framework of the proposed method makes it easily extendable. Our approach achieves state-of-the-art performance on the Pascal-5i and COCO-20i datasets.
少量样本图像分类和分割(FS-CS)任务的目的是对查询图像中的目标对象进行分类和分割,而给定只有几个目标类别的例子。我们引入了 Vision-Instructed Segmentation and Evaluation(VISE)方法,将FS-CS问题转化为视觉问答(VQA)问题,利用视觉语言模型(VLMs),并且以无需训练的方式解决了这个问题。通过使视觉模型与通用视觉模型作为工具进行交互,所提出的方法能够使用仅有的图像级别标签对目标对象进行分类和分割。具体来说,连锁思考提示和上下文学习引导VLM像人类一样回答多选题;像YOLO和Segment Anything Model(SAM)这样的视觉模型帮助VLM完成任务。所提出方法的模块化框架使其易于扩展。我们的方法在Pascal-5i和COCO-20i数据集上实现了最先进的性能。
https://arxiv.org/abs/2403.10287
In goal-oriented communications, the objective of the receiver is often to apply a Deep-Learning model, rather than reconstructing the original data. In this context, direct learning over compressed data, without any prior decoding, holds promise for enhancing the time-efficient execution of inference models at the receiver. However, conventional entropic-coding methods like Huffman and Arithmetic break data structure, rendering them unsuitable for learning without decoding. In this paper, we propose an alternative approach in which entropic coding is realized with Low-Density Parity Check (LDPC) codes. We hypothesize that Deep Learning models can more effectively exploit the internal code structure of LDPC codes. At the receiver, we leverage a specific class of Recurrent Neural Networks (RNNs), specifically Gated Recurrent Unit (GRU), trained for image classification. Our numerical results indicate that classification based on LDPC-coded bit-planes surpasses Huffman and Arithmetic coding, while necessitating a significantly smaller learning model. This demonstrates the efficiency of classification directly from LDPC-coded data, eliminating the need for any form of decompression, even partial, prior to applying the learning model.
在目标导向的通信中,接收者的目标通常是应用深度学习模型,而不是重构原始数据。在这种情况下,直接在压缩数据上进行熵编码,无需前解码,对于增强接收器中推理模型的时间效率具有前景。然而,像Huffman和Arithmetic这样的传统熵编码方法会破坏数据结构,使得它们不适合无需解码的学习。在本文中,我们提出了一个替代方法,其中熵编码通过低密度奇偶校验(LDPC)码实现。我们假设深度学习模型可以更有效地利用LDPC码的内部编码结构。在接收端,我们利用一种特定的循环神经网络(RNN)类,特别是门控循环单元(GRU),为图像分类进行训练。我们的数值结果表明,基于LDPC编码的位平面分类超过了Huffman和Arithmetic编码,而需要的学习模型远小得多。这证明了从LDPC编码数据中直接进行分类的高效性,无需进行任何形式的解码,即使是部分解码。
https://arxiv.org/abs/2403.10202
Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs, recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However, these works faced the speed-accuracy trade-off caused by the loss of information. Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper, we propose a Multi-criteria Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria (e.g., similarity, informativeness, and size of fused tokens). Further, we utilize the one-step-ahead attention, which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency, we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5%, and +0.3%) over the base model, respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup without performance degradation. Code is available at this https URL.
视觉Transformer(ViT)已成为计算机视觉领域的一个重要骨干。为了提高ViT的效率,最近的工作通过剪枝或融合冗余的词来减轻自注意层的有功成本。然而,这些工作由于信息损失而面临速度准确率权衡。在这里,我们认为关键词融合需要考虑关键词之间多样性的关系,以最小化信息损失。在本文中,我们提出了一种多标准词融合(MCTF),根据多个标准(例如相似性、信息性和融合词的大小)逐步融合关键词。此外,我们还利用了One-step-ahead attention,这是更有效的捕捉关键词信息的方法。通过使用词减少一致性训练模型,我们在图像分类(ImageNet1K)上实现了最佳的速率和准确率权衡。实验结果证明,MCTF在未进行训练的情况下,始终超越了前面的减少方法。具体来说,DeiT-T和DeiT-S使用MCTF分别将FLOPs减少了约44%,同时提高了性能 (+0.5%和+0.3%)。我们还证明了MCTF在各种ViT(例如T2T-ViT和LV-ViT)上的适用性,实现了至少31%的性能提升,而没有性能下降。代码可在此处访问:https://thisurl.com/
https://arxiv.org/abs/2403.10030
This paper proposes an optimization of an existing Deep Neural Network (DNN) that improves its hardware utilization and facilitates on-device training for resource-constrained edge environments. We implement efficient parameter reduction strategies on Xception that shrink the model size without sacrificing accuracy, thus decreasing memory utilization during training. We evaluate our model in two experiments: Caltech-101 image classification and PCB defect detection and compare its performance against the original Xception and lightweight models, EfficientNetV2B1 and MobileNetV2. The results of the Caltech-101 image classification show that our model has a better test accuracy (76.21%) than Xception (75.89%), uses less memory on average (847.9MB) than Xception (874.6MB), and has faster training and inference times. The lightweight models overfit with EfficientNetV2B1 having a 30.52% test accuracy and MobileNetV2 having a 58.11% test accuracy. Both lightweight models have better memory usage than our model and Xception. On the PCB defect detection, our model has the best test accuracy (90.30%), compared to Xception (88.10%), EfficientNetV2B1 (55.25%), and MobileNetV2 (50.50%). MobileNetV2 has the least average memory usage (849.4MB), followed by our model (865.8MB), then EfficientNetV2B1 (874.8MB), and Xception has the highest (893.6MB). We further experiment with pre-trained weights and observe that memory usage decreases thereby showing the benefits of transfer learning. A Pareto analysis of the models' performance shows that our optimized model architecture satisfies accuracy and low memory utilization objectives.
本文提出了一种对现有Deep Neural Network(DNN)的优化,提高了其硬件利用率,并为资源受限的边缘环境提供了在设备上进行训练。我们在Xception上实现了高效的参数减少策略,使得模型大小缩小而不会牺牲准确性,从而在训练过程中降低了内存利用率。我们对我们的模型在两个实验:Caltech-101图像分类和PCB缺陷检测进行了评估,并将其性能与原始Xception和轻量级模型(EfficientNetV2B1和MobileNetV2)进行了比较。 在Caltech-101图像分类实验中,我们的模型在测试精度(76.21%)上比Xception(75.89%)表现更好,平均内存占用(847.9MB)比Xception(874.6MB)少,训练和推理时间更快。轻量模型的性能不如EfficientNetV2B1和MobileNetV2。在PCB缺陷检测上,我们的模型在测试精度(90.30%)上比Xception(88.10%)表现更好,比EfficientNetV2B1(55.25%)和MobileNetV2(50.50%)更优秀。移动NetV2的均内存占用最少(849.4MB),其次是我们的模型(865.8MB),然后是EfficientNetV2B1(874.8MB),Xception的均内存占用最高(893.6MB)。我们进一步实验了预训练权重,并观察到,随着转移学习的应用,内存使用量降低,证明了参数转移学习带来的好处。对模型性能的帕累托分析表明,优化后的模型架构满足了准确性和低内存利用的目标。
https://arxiv.org/abs/2403.10569
In spite of their huge success, transformer models remain difficult to scale in depth. In this work, we develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores. We also propose DeepScaleLM, an initialization and scaling scheme that conserves unit output/gradient moments throughout the model, enabling the training of very deep models with 100s of layers. We find that transformer models could be much deeper - our deep models with fewer parameters outperform shallow models in Language Modeling, Speech Translation, and Image Classification, across Encoder-only, Decoder-only and Encoder-Decoder variants, for both Pre-LN and Post-LN transformers, for multiple datasets and model sizes. These improvements also translate into improved performance on downstream Question Answering tasks and improved robustness for image classification.
尽管Transformer模型取得了巨大的成功,但在深度扩展方面仍然存在困难。在这项工作中,我们开发了一个统一的信号传播理论,并提供了解决Transformer模型中前馈和反馈信号的度量的公式。我们的框架可用于理解并减轻与高得分相关的问题,包括消失/爆炸梯度、秩崩溃和稳定性。我们还提出了DeepScaleLM,一种初始化和扩展方案,它在整个模型中保留单位输出/梯度 moment,使得100多层的模型也能进行训练。我们发现,Transformer模型可以更深 - 在自然语言处理、语音翻译和图像分类任务中,我们的参数更少的深模型优于浅模型,无论是否使用编码器-仅、解码器-仅还是编码器-解码器变体,对于预先训练的多个数据集和不同模型大小。这些改进还转化为在下游问题回答任务上的更好性能和图像分类任务的增强稳健性。
https://arxiv.org/abs/2403.09635
Utilizing potent representations of the large vision-language models (VLMs) to accomplish various downstream tasks has attracted increasing attention. Within this research field, soft prompt learning has become a representative approach for efficiently adapting VLMs such as CLIP, to tasks like image classification. However, most existing prompt learning methods learn text tokens that are unexplainable, which cannot satisfy the stringent interpretability requirements of Explainable Artificial Intelligence (XAI) in high-stakes scenarios like healthcare. To address this issue, we propose a novel explainable prompt learning framework that leverages medical knowledge by aligning the semantics of images, learnable prompts, and clinical concept-driven prompts at multiple granularities. Moreover, our framework addresses the lack of valuable concept annotations by eliciting knowledge from large language models and offers both visual and textual explanations for the prompts. Extensive experiments and explainability analyses conducted on various datasets, with and without concept labels, demonstrate that our method simultaneously achieves superior diagnostic performance, flexibility, and interpretability, shedding light on the effectiveness of foundation models in facilitating XAI. The code will be made publically available.
利用大型视觉语言模型(VLMs)完成各种下游任务的强大表示已经引起了越来越多的关注。在这个研究领域,软提示学习已成为将诸如CLIP这样的VLM适应图像分类等任务的效率代表方法。然而,大多数现有的提示学习方法学习无法解释的文本标记,这无法满足在如医疗保健等高风险场景中实现可解释人工智能(XAI)的严格可解释性要求。为解决这个问题,我们提出了一个新颖的可解释提示学习框架,它利用医疗知识将图像、可学习提示和临床概念驱动的提示的语义对齐多个粒度。此外,我们的框架通过从大型语言模型中激发知识解决了概念注释不足的问题,并为提示提供视觉和文本解释。在各种数据集上进行的大量实验和可解释性分析表明,我们的方法同时实现了卓越的诊断性能、灵活性和可解释性,为基线模型在促进XAI中的有效性提供了曙光。代码将公开发布。
https://arxiv.org/abs/2403.09410
Medical data often exhibits distribution shifts, which cause test-time performance degradation for deep learning models trained using standard supervised learning pipelines. This challenge is addressed in the field of Domain Generalization (DG) with the sub-field of Single Domain Generalization (SDG) being specifically interesting due to the privacy- or logistics-related issues often associated with medical data. Existing disentanglement-based SDG methods heavily rely on structural information embedded in segmentation masks, however classification labels do not provide such dense information. This work introduces a novel SDG method aimed at medical image classification that leverages channel-wise contrastive disentanglement. It is further enhanced with reconstruction-based style regularization to ensure extraction of distinct style and structure feature representations. We evaluate our method on the complex task of multicenter histopathology image classification, comparing it against state-of-the-art (SOTA) SDG baselines. Results demonstrate that our method surpasses the SOTA by a margin of 1% in average accuracy while also showing more stable performance. This study highlights the importance and challenges of exploring SDG frameworks in the context of the classification task. The code is publicly available at this https URL
医疗数据通常表现出分布变化,这导致使用标准监督学习途径训练的深度学习模型在测试时间内性能下降。这个问题在领域通用(DG)领域通过域泛化(SDG)子领域以及由于与医疗数据相关的隐私或物流问题而备受关注的单域泛化(SDG)得到了解决。现有的基于分离的SDG方法很大程度上依赖于分割掩码中嵌入的结构性信息,然而分类标签并不提供这样的丰富信息。本文介绍了一种新的SDG方法,旨在解决医疗图像分类问题,该方法利用通道级的对比性分离。它通过基于重构的样式正则化进一步增强了提取独特样式和结构特征表示。我们在多中心组织病理学图像分类的复杂任务上评估我们的方法,并将其与最先进的SDG基线进行比较。结果表明,与最先进的SDG基线相比,我们的方法在平均准确度上提高了1%的领先优势,同时表现出更稳定的性能。本研究突出了在分类任务背景下探索SDG框架的重要性以及所面临的挑战。代码公开在https:// this URL
https://arxiv.org/abs/2403.09400
The CLIP (Contrastive Language-Image Pretraining) model has exhibited outstanding performance in recognition problems, such as zero-shot image classification and object detection. However, its ability to count remains understudied due to the inherent challenges of transforming counting--a regression task--into a recognition task. In this paper, we investigate CLIP's potential in counting, focusing specifically on estimating crowd sizes. Existing classification-based crowd-counting methods have encountered issues, including inappropriate discretization strategies, which impede the application of CLIP and result in suboptimal performance. To address these challenges, we propose the Enhanced Blockwise Classification (EBC) framework. In contrast to previous methods, EBC relies on integer-valued bins that facilitate the learning of robust decision boundaries. Within our model-agnostic EBC framework, we introduce CLIP-EBC, the first fully CLIP-based crowd-counting model capable of generating density maps. Comprehensive evaluations across diverse crowd-counting datasets demonstrate the state-of-the-art performance of our methods. Particularly, EBC can improve existing models by up to 76.9%. Moreover, our CLIP-EBC model surpasses current crowd-counting methods, achieving mean absolute errors of 55.0 and 6.3 on ShanghaiTech part A and part B datasets, respectively. The code will be made publicly available.
CLIP(对比语言-图像预训练)模型在识别问题(如零散图像分类和目标检测)中表现出色。然而,由于将计数任务转化为识别任务的固有挑战,CLIP在计数方面的能力仍需深入研究。在本文中,我们研究了CLIP在计数方面的潜力,并重点关注估计人群规模。现有基于分类的人群计数方法遇到了问题,包括不适当的离散化策略,这阻碍了CLIP的应用,导致性能较低。为了应对这些挑战,我们提出了增强块级分类(EBC)框架。与之前的方法不同,EBC依赖于整数值区间,这有助于学习稳健的决策边界。在我们的模型无关的EBC框架中,我们引入了CLIP-EBC,第一个完全基于CLIP的 crowd-counting 模型,能够生成密度图。在多样的人群计数数据集上进行全面的评估,证明我们的方法具有最先进的表现。特别是,EBC可以提高现有模型的性能,最高可达76.9%。此外,我们的CLIP-EBC模型超越了当前的人群计数方法,在 ShanghaiTech A 和 B 数据集上实现了平均绝对误差分别为55.0 和6.3。代码将公开发布。
https://arxiv.org/abs/2403.09281