Abstract
With the increasing application of large language models (LLMs) in the medical domain, evaluating these models' performance using benchmark datasets has become crucial. This paper presents a comprehensive survey of various benchmark datasets employed in medical LLM tasks. These datasets span multiple modalities including text, image, and multimodal benchmarks, focusing on different aspects of medical knowledge such as electronic health records (EHRs), doctor-patient dialogues, medical question-answering, and medical image captioning. The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs for clinical tasks such as diagnosis, report generation, and predictive decision support. Key benchmarks include MIMIC-III, MIMIC-IV, BioASQ, PubMedQA, and CheXpert, which have facilitated advancements in tasks like medical report generation, clinical summarization, and synthetic data generation. The paper summarizes the challenges and opportunities in leveraging these benchmarks for advancing multimodal medical intelligence, emphasizing the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis. This work also provides a foundation for future research in the application of LLMs in medicine, contributing to the evolving field of medical artificial intelligence.
Abstract (translated)
随着大型语言模型(LLMs)在医疗领域的应用日益广泛,使用基准数据集评估这些模型的性能变得至关重要。本文综述了用于医学LLM任务的各种基准数据集。这些数据集涵盖了文本、图像和多模态基准等多个模式,重点关注电子健康记录(EHRs)、医患对话、医学问答以及医学图像标注等不同方面的医疗知识。本综述按模态对数据集进行了分类,并讨论了它们在诊断、报告生成及预测决策支持等临床任务中开发LLMs的重要性、数据结构和影响。关键的基准包括MIMIC-III、MIMIC-IV、BioASQ、PubMedQA 和 CheXpert,这些基准促进了医学报告生成、临床总结以及合成数据生成等任务的发展。文章总结了利用这些基准推进多模态医疗智能面临的挑战与机遇,并强调需要具备更高语言多样性、结构化组学数据及创新合成方法的数据集。这项工作还为未来在医学领域应用LLMs的研究奠定了基础,有助于推动不断发展的医学人工智能领域的进步。
URL
https://arxiv.org/abs/2410.21348