Abstract
The objective of this study is to address the critical issue of de-identification of clinical reports in order to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse. We annotated a corpus of clinical documents according to 12 types of identifying entities, and built a hybrid system, merging the results of a deep learning model as well as manual rules. Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.
Abstract (translated)
本研究的目标是解决临床报告的易错性问题,以便允许进行科学研究,同时确保患者隐私。研究强调了在这个领域的分享工具和资源所面临的困难,并介绍了巴黎大巴黎大学医院(AP-HP)在从临床数据仓库中系统命名化文本文档的经验。我们对临床文档的语料库进行了注释,按照12种识别实体类型进行标注,并建立了一个混合系统,将深度学习模型和手动规则的结果合并。我们的结果显示整体表现达到F1得分的0.99。我们讨论了实现选择,并介绍了实验,以更好地理解这种任务所需的努力,包括数据集大小、文档类型、语言模型或规则增加。我们遵循3项BSD许可证分享指南和代码。
URL
https://arxiv.org/abs/2303.13451