Abstract
The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering task with answer options for evaluation. However, in real clinical settings, many clinical decisions, such as treatment recommendations, involve answering open-ended questions without pre-set options. Meanwhile, existing studies mainly use accuracy to assess model performance. In this paper, we comprehensively benchmark diverse LLMs in healthcare, to clearly understand their strengths and weaknesses. Our benchmark contains seven tasks and thirteen datasets across medical language generation, understanding, and reasoning. We conduct a detailed evaluation of the existing sixteen LLMs in healthcare under both zero-shot and few-shot (i.e., 1,3,5-shot) learning settings. We report the results on five metrics (i.e. matching, faithfulness, comprehensiveness, generalizability, and robustness) that are critical in achieving trust from clinical users. We further invite medical experts to conduct human evaluation.
Abstract (translated)
大语言模型(LLMs)用于协助临床医生的应用引起了相当的关注。现有的作品主要采用答案选项为评估的关闭式问题回答任务。然而,在实际临床环境中,许多临床决策,如治疗建议,涉及回答没有预设选项的开放性问题。同时,现有的研究主要使用准确性评估模型的性能。在本文中,我们全面评估了医疗保健领域中的各种LLM,以清晰地了解它们的优缺点。我们的基准包含医疗语言生成、理解和推理七个任务以及医疗语言数据集。我们在零散和少量(即1,3,5- shot)学习设置下对现有的16个LLM进行了详细评估。我们报告了五个对实现临床用户信任至关重要的指标(即匹配、忠实地、全面性、可推广性和鲁棒性)的结果。我们进一步邀请医学专家进行人评。
URL
https://arxiv.org/abs/2405.00716