Abstract
In numerous studies, deep learning algorithms have proven their potential for the analysis of histopathology images, for example, for revealing the subtypes of tumors or the primary origin of metastases. These models require large datasets for training, which must be anonymized to prevent possible patient identity leaks. This study demonstrates that even relatively simple deep learning algorithms can re-identify patients in large histopathology datasets with substantial accuracy. We evaluated our algorithms on two TCIA datasets including lung squamous cell carcinoma (LSCC) and lung adenocarcinoma (LUAD). We also demonstrate the algorithm's performance on an in-house dataset of meningioma tissue. We predicted the source patient of a slide with F1 scores of 50.16 % and 52.30 % on the LSCC and LUAD datasets, respectively, and with 62.31 % on our meningioma dataset. Based on our findings, we formulated a risk assessment scheme to estimate the risk to the patient's privacy prior to publication.
Abstract (translated)
在许多研究中,深度学习算法已经证明了其在病理学图像分析中的潜力,例如,揭示肿瘤亚型或转移灶的原始来源。这些模型需要大量的数据集进行训练,为了防止可能的患者身份泄露,这些数据集必须匿名化。这项研究展示了即使是相对简单的深度学习算法,也可以在大型病理学数据集中准确地重新识别患者。我们在两个TCIA数据集上评估了我们的算法,包括肺鳞状细胞癌(LSCC)和肺腺癌(LUAD)。我们还将在本体内膜组织数据集上评估算法的性能。我们预测了LSCC和LUAD数据集中的幻灯片来源患者的F1分数分别为50.16%和52.30%,而在本体内膜组织数据集上的分数为62.31%。根据我们的研究结果,我们制定了一个风险评估方案,以估计在发表前对患者隐私的风险。
URL
https://arxiv.org/abs/2403.12816