MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

Abstract
Abstract (translated)
URL
PDF

Abstract

Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of "fact-checking" are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.

Abstract (translated)

意识到LLM输出是否可以基于证据是自然语言处理（NLP）中许多任务的关键：检索增强生成、总结、文档导向对话等。目前针对这种“事实核查”的方法是基于使用LLM验证每个模型生成的每个部分与潜在证据是否相符。然而，这种过程非常计算密集，需要多次调用LLM来检查单个响应。在这项工作中，我们展示了如何构建具有GPT-4级性能的模型，但成本却降低了400倍。我们通过使用GPT-4构建合成训练数据来实现这一点，这是一种通过结构化生成程序创建现实且具有挑战性的事实错误的途径。在这种数据上训练使模型检查每个主张中的事实，并识别句子间信息合成。为了评估，我们将现有的数据集统一到一个名为LLM-AggregFact的基准LLM-聚合数据集，该数据集是由最近关于事实核查和将LLM生成与证据结合的工作收集的。我们的最佳系统 MiniCheck-FT5 (770M参数) 超越了所有具有相当大小且达到GPT-4精度的系统。我们发布了LLM-AggregFact数据集、数据合成代码和模型。

URL

https://arxiv.org/abs/2404.10774

PDF

https://arxiv.org/pdf/2404.10774.pdf

MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

Abstract

Abstract (translated)

URL

PDF Copy

PDF