A Literature Review and Framework for Human Evaluation of Generative Large Language Models in Healthcare

Abstract
Abstract (translated)
URL
PDF

Abstract

As generative artificial intelligence (AI), particularly Large Language Models (LLMs), continues to permeate healthcare, it remains crucial to supplement traditional automated evaluations with human expert evaluation. Understanding and evaluating the generated texts is vital for ensuring safety, reliability, and effectiveness. However, the cumbersome, time-consuming, and non-standardized nature of human evaluation presents significant obstacles to the widespread adoption of LLMs in practice. This study reviews existing literature on human evaluation methodologies for LLMs within healthcare. We highlight a notable need for a standardized and consistent human evaluation approach. Our extensive literature search, adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, spans publications from January 2018 to February 2024. This review provides a comprehensive overview of the human evaluation approaches used in diverse healthcare applications.This analysis examines the human evaluation of LLMs across various medical specialties, addressing factors such as evaluation dimensions, sample types, and sizes, the selection and recruitment of evaluators, frameworks and metrics, the evaluation process, and statistical analysis of the results. Drawing from diverse evaluation strategies highlighted in these studies, we propose a comprehensive and practical framework for human evaluation of generative LLMs, named QUEST: Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence. This framework aims to improve the reliability, generalizability, and applicability of human evaluation of generative LLMs in different healthcare applications by defining clear evaluation dimensions and offering detailed guidelines.

Abstract (translated)

作为生成人工智能（AI），特别是大型语言模型（LLMs），继续在医疗保健领域渗透，确保通过人类专家评估补充传统自动化评估至关重要。理解并评估生成的文本对确保安全性、可靠性和有效性至关重要。然而，人类评估的繁琐、耗时和非标准化性质在实际应用中造成了重大障碍，这使得LLMs在医疗保健领域的广泛采用面临挑战。本研究回顾了LLM在医疗保健领域现有的人评估方法论。我们强调了标准化和一致的人评估方法的重要性。我们广泛搜索了从2018年1月至2024年2月期间发表的出版物，遵循PRISMA指南，对评估方法进行了深入研究。本审查提供了各种医疗保健应用中使用的人评估方法的全面概述。本分析研究了LLM在不同医学专业领域的人评估，包括评估维度、样本类型和大小、评估者的选择和招募、评估框架和指标以及评估过程和数据分析结果。从这些研究中强调的不同评估策略中，我们提出了一个全面且实用的框架，名为QUEST：信息质量、理解、推理、表达风格和人物、安全和伤害、信任和信心。该框架旨在通过明确评估维度并为不同的医疗保健应用提供详细指导，提高人类评估LLM的可靠性、可重复性和适用性。

URL

https://arxiv.org/abs/2405.02559

PDF

https://arxiv.org/pdf/2405.02559.pdf

A Literature Review and Framework for Human Evaluation of Generative Large Language Models in Healthcare

Abstract

Abstract (translated)

URL

PDF Copy

PDF