DinoBloom: A Foundation Model for Generalizable Cell Embeddings in Hematology

2024-04-07 17:25:52
Valentin Koch, Sophia J. Wagner, Salome Kazeminia, Ece Sancar, Matthias Hehr, Julia Schnabel, Tingying Peng, Carsten Marr


In hematology, computational models offer significant potential to improve diagnostic accuracy, streamline workflows, and reduce the tedious work of analyzing single cells in peripheral blood or bone marrow smears. However, clinical adoption of computational models has been hampered by the lack of generalization due to large batch effects, small dataset sizes, and poor performance in transfer learning from natural images. To address these challenges, we introduce DinoBloom, the first foundation model for single cell images in hematology, utilizing a tailored DINOv2 pipeline. Our model is built upon an extensive collection of 13 diverse, publicly available datasets of peripheral blood and bone marrow smears, the most substantial open-source cohort in hematology so far, comprising over 380,000 white blood cell images. To assess its generalization capability, we evaluate it on an external dataset with a challenging domain shift. We show that our model outperforms existing medical and non-medical vision models in (i) linear probing and k-nearest neighbor evaluations for cell-type classification on blood and bone marrow smears and (ii) weakly supervised multiple instance learning for acute myeloid leukemia subtyping by a large margin. A family of four DinoBloom models (small, base, large, and giant) can be adapted for a wide range of downstream applications, be a strong baseline for classification problems, and facilitate the assessment of batch effects in new datasets. All models are available at this http URL.

Abstract (translated)

在血液学中,计算模型具有显著的提高诊断准确度、简化工作流程和减轻分析单个细胞在外周血或骨髓涂片中的繁琐工作的潜力。然而,临床采用计算模型受到了由于大规模批效应、数据集较小以及自然图像迁移学习性能差等问题的阻碍。为解决这些问题,我们引入了DinoBloom,第一个用于血液学单个细胞图像的基础模型,利用定制化的DINOv2管道。我们的模型基于一个广泛的 peripheral blood 和 bone marrow smears 的13个不同的公开可用数据集,这是目前血液学开放源代码队列中最大的,包括超过380,000个白细胞图像。为了评估其泛化能力,我们在具有具有挑战性领域转移的外部数据集上对其进行评估。我们发现,我们的模型在(i)血液和骨髓涂片细胞类型分类的线性探测和k-最近邻评估以及(ii)大样本弱监督多实例学习急性髓系白血病亚型分型的性能方面均优于现有医学和非医学视觉模型。四款DinoBloom模型(小、基础、大、巨)可以适应广泛的下游应用,可以作为分类问题的强基线,并有助于在新技术数据集中评估批效应。所有模型都可以在上述http URL找到。



