Abstract
The advent of Large Language Models (LLMs) has significantly transformed the AI landscape, enhancing machine learning and AI capabilities. Factuality issue is a critical concern for LLMs, as they may generate factually incorrect responses. In this paper, we propose GraphEval to evaluate an LLM's performance using a substantially large test dataset. Specifically, the test dataset is retrieved from a large knowledge graph with more than 10 million facts without expensive human efforts. Unlike conventional methods that evaluate LLMs based on generated responses, GraphEval streamlines the evaluation process by creating a judge model to estimate the correctness of the answers given by the LLM. Our experiments demonstrate that the judge model's factuality assessment aligns closely with the correctness of the LLM's generated outputs, while also substantially reducing evaluation costs. Besides, our findings offer valuable insights into LLM performance across different metrics and highlight the potential for future improvements in ensuring the factual integrity of LLM outputs. The code is publicly available at this https URL.
Abstract (translated)
大规模语言模型的出现显著地改变了人工智能领域,增强了机器学习和人工智能能力。事实性问题对大规模语言模型来说是一个关键的担忧,因为它们可能生成事实性错误的回答。在本文中,我们提出了GraphEval来评估一个大规模语言模型的性能,使用相当大的测试数据集。具体来说,测试数据集是从一个拥有超过1000万条事实的大型知识图谱中检索的。与基于生成回答的传统方法不同,GraphEval通过创建一个判断模型来简化评估过程,以估计LLM给出的答案的正确性。我们的实验结果表明,判断模型的事实性评估与LLM生成输出的正确性非常接近,同时大大降低了评估成本。此外,我们的研究为不同指标下的大规模语言模型性能提供了宝贵的见解,并强调了确保LLM输出事实性一致性的未来改进潜力。代码可在此处公开访问:https://www.acm.org/document/ASC-027161000011002
URL
https://arxiv.org/abs/2404.00942