Paper Reading AI Learner

MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

2024-04-16 17:59:10
Liyan Tang, Philippe Laban, Greg Durrett

Abstract

Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of "fact-checking" are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.

Abstract (translated)

意识到LLM输出是否可以基于证据是自然语言处理(NLP)中许多任务的关键:检索增强生成、总结、文档导向对话等。目前针对这种“事实核查”的方法是基于使用LLM验证每个模型生成的每个部分与潜在证据是否相符。然而,这种过程非常计算密集,需要多次调用LLM来检查单个响应。在这项工作中,我们展示了如何构建具有GPT-4级性能的模型,但成本却降低了400倍。我们通过使用GPT-4构建合成训练数据来实现这一点,这是一种通过结构化生成程序创建现实且具有挑战性的事实错误的途径。在这种数据上训练使模型检查每个主张中的事实,并识别句子间信息合成。为了评估,我们将现有的数据集统一到一个名为LLM-AggregFact的基准LLM-聚合数据集,该数据集是由最近关于事实核查和将LLM生成与证据结合的工作收集的。我们的最佳系统 MiniCheck-FT5 (770M参数) 超越了所有具有相当大小且达到GPT-4精度的系统。我们发布了LLM-AggregFact数据集、数据合成代码和模型。

URL

https://arxiv.org/abs/2404.10774

PDF

https://arxiv.org/pdf/2404.10774.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot