Abstract
Recent research on large language models (LLMs) has primarily focused on their adaptation and application in specialized domains. The application of LLMs in the medical field is mainly concentrated on tasks such as the automation of medical report generation, summarization, diagnostic reasoning, and question-and-answer interactions between doctors and patients. The challenge of becoming a good teacher is more formidable than that of becoming a good student, and this study pioneers the application of LLMs in the field of medical education. In this work, we investigate the extent to which LLMs can generate medical qualification exam questions and corresponding answers based on few-shot prompts. Utilizing a real-world Chinese dataset of elderly chronic diseases, we tasked the LLMs with generating open-ended questions and answers based on a subset of sampled admission reports across eight widely used LLMs, including ERNIE 4, ChatGLM 4, Doubao, Hunyuan, Spark 4, Qwen, Llama 3, and Mistral. Furthermore, we engaged medical experts to manually evaluate these open-ended questions and answers across multiple dimensions. The study found that LLMs, after using few-shot prompts, can effectively mimic real-world medical qualification exam questions, whereas there is room for improvement in the correctness, evidence-based statements, and professionalism of the generated answers. Moreover, LLMs also demonstrate a decent level of ability to correct and rectify reference answers. Given the immense potential of artificial intelligence in the medical field, the task of generating questions and answers for medical qualification exams aimed at medical students, interns and residents can be a significant focus of future research.
Abstract (translated)
最近关于大型语言模型(LLMs)的研究主要集中在它们在专业领域的适应和应用上。LLMs 在医疗领域的应用主要集中于诸如医疗报告生成自动化、摘要、诊断推理以及医生与患者之间的问答互动等任务。成为一个好老师比成为一个好学生更具挑战性,这项研究开创了将 LLMs 应用于医学教育领域的新局面。在这项工作中,我们调查了基于少量样本提示(few-shot prompts),LLMs 能够生成多少医疗资格考试问题及其相应答案的程度。使用一个真实世界中的中文老年人慢性疾病数据集,我们将八种广泛使用的 LLMs(包括 ERNIE 4、ChatGLM 4、Doubao、Hunyuan、Spark 4、Qwen、Llama 3 和 Mistral)任务设定为根据采样的一部分入院报告生成开放式问题和答案。此外,我们还让医疗专家从多个维度手动评估这些开放式问题与答案的质量。研究发现,在使用少量样本提示后,LLMs 可以有效地模拟真实世界中的医疗资格考试问题,但在生成的答案的准确性、基于证据的陈述及专业性方面仍有改进空间。另外,LLMs 也显示出相当的能力去纠正和修改参考答案。鉴于人工智能在医疗领域巨大的潜力,为医学生、实习医生和住院医师生成医疗资格考试的问题与答案可以成为未来研究的重要方向。
URL
https://arxiv.org/abs/2410.23769