Abstract
The quality and richness of feature maps extracted by convolution neural networks (CNNs) and vision Transformers (ViTs) directly relate to the robust model performance. In medical computer vision, these information-rich features are crucial for detecting rare cases within large datasets. This work presents the "Scopeformer," a novel multi-CNN-ViT model for intracranial hemorrhage classification in computed tomography (CT) images. The Scopeformer architecture is scalable and modular, which allows utilizing various CNN architectures as the backbone with diversified output features and pre-training strategies. We propose effective feature projection methods to reduce redundancies among CNN-generated features and to control the input size of ViTs. Extensive experiments with various Scopeformer models show that the model performance is proportional to the number of convolutional blocks employed in the feature extractor. Using multiple strategies, including diversifying the pre-training paradigms for CNNs, different pre-training datasets, and style transfer techniques, we demonstrate an overall improvement in the model performance at various computational budgets. Later, we propose smaller compute-efficient Scopeformer versions with three different types of input and output ViT configurations. Efficient Scopeformers use four different pre-trained CNN architectures as feature extractors to increase feature richness. Our best Efficient Scopeformer model achieved an accuracy of 96.94\% and a weighted logarithmic loss of 0.083 with an eight times reduction in the number of trainable parameters compared to the base Scopeformer. Another version of the Efficient Scopeformer model further reduced the parameter space by almost 17 times with negligible performance reduction. Hybrid CNNs and ViTs might provide the desired feature richness for developing accurate medical computer vision models
Abstract (translated)
卷积神经网络(CNNs)和视觉转换器(ViTs)提取的特征映射质量和丰富程度直接与模型鲁棒性相关。在医学计算机视觉中,这些丰富的特征对于在大型数据集内检测罕见情况至关重要。本文介绍了“Scopeformer”,这是一个多CNN-ViT模型,用于脑磁共振(CT)图像中的内出血分类。Scopeformer架构可以 scalable 和模块化,利用各种CNN架构作为骨架,并使用不同的输出特征和预训练策略。我们提出了有效的特征投影方法,以减少CNN生成特征之间的冗余,并控制ViTs的输入大小。我们对各种Scopeformer模型进行了广泛的实验,结果表明,模型性能与特征提取器中使用的卷积块数量成正比。通过多种策略,包括多样化CNN预训练范式、不同预训练数据集和风格转移技术,我们证明了在不同计算预算下,模型性能的总体改善。后来,我们提出了小型计算高效的Scopeformer版本,使用三种不同类型的输入和输出ViT配置。高效的Scopeformer使用四种不同的预训练CNN架构作为特征提取器,以增加特征丰富度。我们的最佳高效的Scopeformer模型达到96.94\%的精度和0.083的加权logistic损失,与基版Scopeformer相比,训练参数数量减少了8倍。另一个版本的高效的Scopeformer模型进一步减少了参数空间,几乎减少了17倍,而性能变化几乎忽略不计。混合CNN和ViT可能为开发准确的医学计算机视觉模型提供所需的特征丰富度。
URL
https://arxiv.org/abs/2302.00220