MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

Abstract
Abstract (translated)
URL
PDF

Abstract

The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Mask AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel \textit{Mask in Mask (MiM)} pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, \textit{i.e.,} Computed Tomography (CT) images containing various body parts. Extensive experiments on thirteen public datasets demonstrate the superiority of MiM over other SSL methods in organ/lesion/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. The improvement also concluded that the research community should pay more attention to the scale of the pre-training dataset towards the healthcare foundation model for 3D medical images.

Abstract (translated)

Vision Transformer (ViT) 在自监督学习 (SSL) 中对 3D 医疗图像分析的性能已经取得了显著的突破。对于特征预训练，采用掩码自动编码器（MAE）进行预训练可能进一步释放 ViT 在各种医疗视觉任务上的潜力。然而，由于 3D 医疗图像具有较大的空间尺寸和高维，MAE 可能缺乏层次结构设计，这可能会阻碍下游任务的性能。在本文中，我们提出了一个名为“掩码在掩码（MiM）”预训练框架，旨在通过从不同尺度之间的层次视觉令牌中学习具有判别性的表示来提高 MAE 的性能。我们引入了多个级别的粒度来处理掩码输入的体积，然后同时在不同精度和粗略水平上进行重建。此外，在相邻级别卷积中应用跨层对齐机制，以确保解剖结构相似性层次结构的强制。此外，我们还采用了一种混合骨干网络来提高预训练期间层次表示学习的高效性。MiM 在包括全身各个部位的较大范围内预先训练，这些预训练数据集包含各种器官/病灶/肿瘤。在十三个公开数据集上的广泛实验证明，MiM 在其他 SSL 方法在器官/病灶/肿瘤分割和疾病分类方面具有优越性。我们进一步将 MiM 扩展到具有超过 10k 个卷积的预训练大样本数据集，表明大规模预训练可以进一步增强下游任务的性能。这种改进还得出结论，研究社区应更加关注 healthcare foundation model 用于 3D 医疗图像的预训练数据集的大小。

URL

https://arxiv.org/abs/2404.15580

PDF

https://arxiv.org/pdf/2404.15580.pdf

MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

Abstract

Abstract (translated)

URL

PDF Copy

PDF