Abstract
Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.
Abstract (translated)
大型语言模型(LLMs)在高风险、特定领域的环境中被越来越多地用于支持问答和决策制定,例如自然灾害响应和基础设施规划,在这些场景中,有效的回答必须传达细微的、关键性的细节以供决策参考。然而,现有的检索增强生成(RAG)和开放式问题解答评估框架主要依赖于表面相似性、事实一致性或语义相关性,往往无法评估回复是否提供了特定领域决策所需的具体信息。为了解决这一不足,我们提出了一种多维度、无需参照的评价框架,从四个互补的角度来评估LLM输出:具体性、对改写和语义扰动的鲁棒性、答案的相关性和上下文利用情况。为此,我们引入了一个精心策划的数据集,包含1,412个特定领域的问答配对,涵盖40种专业角色和七类自然灾害类型,以支持系统的评估工作。此外,我们还进行了人工评价,以评估注释者之间的协议以及模型输出与人类判断之间的一致性,这突显了开放式、领域特定的评价中固有的主观性。我们的结果显示,单一指标不足以孤立地捕捉答案质量,并且在部署LLMs于高风险应用时,需要采用结构化、多指标的评价框架。
URL
https://arxiv.org/abs/2602.10017