Abstract
This research focuses on evaluating the non-commercial open-source large language models (LLMs) Meditron, MedAlpaca, Mistral, and Llama-2 for their efficacy in interpreting medical guidelines saved in PDF format. As a specific test scenario, we applied these models to the guidelines for hypertension in children and adolescents provided by the European Society of Cardiology (ESC). Leveraging Streamlit, a Python library, we developed a user-friendly medical document chatbot tool (MedDoc-Bot). This tool enables authorized users to upload PDF files and pose questions, generating interpretive responses from four locally stored LLMs. A pediatric expert provides a benchmark for evaluation by formulating questions and responses extracted from the ESC guidelines. The expert rates the model-generated responses based on their fidelity and relevance. Additionally, we evaluated the METEOR and chrF metric scores to assess the similarity of model responses to reference answers. Our study found that Llama-2 and Mistral performed well in metrics evaluation. However, Llama-2 was slower when dealing with text and tabular data. In our human evaluation, we observed that responses created by Mistral, Meditron, and Llama-2 exhibited reasonable fidelity and relevance. This study provides valuable insights into the strengths and limitations of LLMs for future developments in medical document interpretation. Open-Source Code: this https URL
Abstract (translated)
这项研究重点评估了非商业性的开源大型语言模型(LLMs)Meditron、MedAlpaca、Mistral和Llama-2在解释保存在PDF格式中的医疗指南的有效性。作为具体测试场景,我们将这些模型应用于欧洲心脏病学会(ESC)提供的儿童和青少年高血压指南。利用Streamlit,一个Python库,我们开发了一个用户友好的医疗文件聊天机器人工具(MedDoc-Bot)。这个工具允许授权用户上传PDF文件并提出问题,从而从本地存储的四个LLM中生成解释性回答。儿科专家通过构思问题和对ESC指南的回答进行评估,为评估提供了基准。此外,我们还评估了METEOR和chrF指标分数,以评估模型回答与参考答案的相似性。我们的研究发现在指标评估方面,Llama-2和Mistral表现良好。然而,当处理文本和表格数据时,Llama-2的速度较慢。在我们的人类评估中,我们观察到由Mistral、Meditron和Llama-2生成的响应具有合理的忠实度和相关性。这项研究为未来医疗文件解释的发展提供了宝贵的见解。开源代码:此链接:<https://github.com/your-name/meddoc-bot>
URL
https://arxiv.org/abs/2405.03359