Abstract
While Machine Learning (ML) and Deep Learning (DL) models have been widely used for diabetes prediction, the use of Large Language Models (LLMs) for structured numerical data is still not well explored. In this study, we test the effectiveness of LLMs in predicting diabetes using zero-shot, one-shot, and three-shot prompting methods. We conduct an empirical analysis using the Pima Indian Diabetes Database (PIDD). We evaluate six LLMs, including four open-source models: Gemma-2-27B, Mistral-7B, Llama-3.1-8B, and Llama-3.2-2B. We also test two proprietary models: GPT-4o and Gemini Flash 2.0. In addition, we compare their performance with three traditional machine learning models: Random Forest, Logistic Regression, and Support Vector Machine (SVM). We use accuracy, precision, recall, and F1-score as evaluation metrics. Our results show that proprietary LLMs perform better than open-source ones, with GPT-4o and Gemma-2-27B achieving the highest accuracy in few-shot settings. Notably, Gemma-2-27B also outperforms the traditional ML models in terms of F1-score. However, there are still issues such as performance variation across prompting strategies and the need for domain-specific fine-tuning. This study shows that LLMs can be useful for medical prediction tasks and encourages future work on prompt engineering and hybrid approaches to improve healthcare predictions.
Abstract (translated)
尽管机器学习(ML)和深度学习(DL)模型在糖尿病预测中得到了广泛应用,但大型语言模型(LLMs)在处理结构化数值数据方面的应用仍鲜有探索。本研究旨在测试使用零样本、单样本和三样本提示方法的LLM在糖尿病预测中的有效性。我们采用Pima印第安人糖尿病数据库(PIDD)进行实证分析,并评估了六种LLM,包括四种开源模型:Gemma-2-27B、Mistral-7B、Llama-3.1-8B和Llama-3.2-2B。此外还测试了两种专有模型:GPT-4o 和 Gemini Flash 2.0。我们还将这六种LLM与三种传统机器学习模型(随机森林、逻辑回归和支持向量机(SVM))的性能进行了比较。我们使用准确率、精确度、召回率和F1值作为评估指标。 我们的研究结果表明,专有LLM在少样本设置中表现优于开源模型,其中GPT-4o 和 Gemma-2-27B 达到了最高的准确率。值得注意的是,Gemma-2-27B 在 F1 值方面也超越了传统机器学习模型的性能。然而,仍然存在一些问题,如提示策略不同导致的表现差异和需要特定领域的微调需求。 这项研究表明LLM在医疗预测任务中具有一定的实用性,并鼓励未来的研究关注于改进提示工程以及开发混合方法来进一步提升医疗预测的效果。
URL
https://arxiv.org/abs/2506.14949