Abstract
With the increasing use of large language models (LLMs) for generating answers to biomedical questions, it is crucial to evaluate the quality of the generated answers and the references provided to support the facts in the generated answers. Evaluation of text generated by LLMs remains a challenge for question answering, retrieval-augmented generation (RAG), summarization, and many other natural language processing tasks in the biomedical domain, due to the requirements of expert assessment to verify consistency with the scientific literature and complex medical terminology. In this work, we propose BioACE, an automated framework for evaluating biomedical answers and citations against the facts stated in the answers. The proposed BioACE framework considers multiple aspects, including completeness, correctness, precision, and recall, in relation to the ground-truth nuggets for answer evaluation. We developed automated approaches to evaluate each of the aforementioned aspects and performed extensive experiments to assess and analyze their correlation with human evaluations. In addition, we considered multiple existing approaches, such as natural language inference (NLI) and pre-trained language models and LLMs, to evaluate the quality of evidence provided to support the generated answers in the form of citations into biomedical literature. With the detailed experiments and analysis, we provide the best approaches for biomedical answer and citation evaluation as a part of BioACE (this https URL) evaluation package.
Abstract (translated)
随着大型语言模型(LLMs)在生成生物医学问题答案中的使用越来越广泛,评估这些答案的质量以及支持事实引用的准确性变得至关重要。由于需要专家评估以验证与科学文献和复杂医疗术语的一致性,在回答、检索增强生成(RAG)、总结以及其他自然语言处理任务中,对由LLM生成文本进行评价仍然是一个挑战。在这项工作中,我们提出了BioACE,这是一个用于自动评估生物医学答案及其引用的框架。BioACE框架考虑了多个方面,包括完整性、准确性、精确度和召回率,以评价相对于事实基准的答案质量。 我们开发了自动化方法来评估上述每个方面,并进行了广泛的实验,以评估这些方法与人类评价的相关性。此外,我们还采用了一些现有的方法,如自然语言推理(NLI)以及预训练的语言模型和大型语言模型,来评估生成答案中提供的引用证据的质量。通过详细的实验和分析,我们将BioACE框架中的最佳评估生物医学答案及其引文质量的方法提供给研究社区作为评价包的一部分(访问此链接:this https URL)。
URL
https://arxiv.org/abs/2602.04982