Paper Reading AI Learner

VeriFastScore: Speeding up long-form factuality evaluation

2025-05-22 17:51:25
Rishanth Rajendhran, Amir Zadeh, Matthew Sarte, Chuan Li, Mohit Iyyer

Abstract

Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.

Abstract (translated)

评估长文事实性的指标,如FactScore和VeriScore,通过将输入响应分解为原子声明,并逐一验证每个声明来工作。尽管这些方法有效且易于理解,但它们需要进行大量的大型语言模型(LLM)调用,并且可能需要长达100秒才能评估单个响应,这在大规模评估和训练场景中是不切实际的。为了应对这一挑战,我们提出了VeriFastScore,该方法利用合成数据对Llama3.1 8B进行微调,使其能够同时从给定文本中提取并根据Google搜索提供的证据验证所有可核实声明。 我们展示了这样一个任务不能通过使用封闭式LLM的少量提示来解决,因为其复杂性:模型需要接收平均约4K令牌的证据,并且需要同时分解声明、判断它们的可证实性和将它们与噪声证据进行对比。然而,我们的微调VeriFastScore模型在示例级别(r=0.80)和系统级别(r=0.94)上都显示出与原始VeriScore管道具有很强的相关性,并且比VeriScore整体速度提高了6.6倍(不包括证据检索则为9.9倍)。 为了促进未来的事实性研究,我们公开发布了我们的VeriFastScore模型和合成数据集。

URL

https://arxiv.org/abs/2505.16973

PDF

https://arxiv.org/pdf/2505.16973.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot