Paper Reading AI Learner

SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

2026-02-10 17:39:17
Homaira Huda Shomee, Rochana Chaturvedi, Yangxinyu Xie, Tanwi Mallick

Abstract

Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.

Abstract (translated)

大型语言模型(LLMs)在高风险、特定领域的环境中被越来越多地用于支持问答和决策制定,例如自然灾害响应和基础设施规划,在这些场景中,有效的回答必须传达细微的、关键性的细节以供决策参考。然而,现有的检索增强生成(RAG)和开放式问题解答评估框架主要依赖于表面相似性、事实一致性或语义相关性,往往无法评估回复是否提供了特定领域决策所需的具体信息。为了解决这一不足,我们提出了一种多维度、无需参照的评价框架,从四个互补的角度来评估LLM输出:具体性、对改写和语义扰动的鲁棒性、答案的相关性和上下文利用情况。为此,我们引入了一个精心策划的数据集,包含1,412个特定领域的问答配对,涵盖40种专业角色和七类自然灾害类型,以支持系统的评估工作。此外,我们还进行了人工评价,以评估注释者之间的协议以及模型输出与人类判断之间的一致性,这突显了开放式、领域特定的评价中固有的主观性。我们的结果显示,单一指标不足以孤立地捕捉答案质量,并且在部署LLMs于高风险应用时,需要采用结构化、多指标的评价框架。

URL

https://arxiv.org/abs/2602.10017

PDF

https://arxiv.org/pdf/2602.10017.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot