Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality

Abstract
Abstract (translated)
URL
PDF

Abstract

LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. But not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. In order to comprehensively evaluate the reliability of evaluators based on LLMs, we construct two adversarial meta-evaluation dialogue generation datasets KdConv-ADV and DSTC7-ADV based on KdConv and DSTC7-AVSD, respectively. Compared to previous meta-evaluation benchmarks, KdConv-ADV and DSTC7-ADV are much more challenging since they requires evaluators to be able to reasonably evaluate closed-ended examples with the help of external knowledge or even its own knowledge. Empirical results show that the ability of LLMs to identify unreasonable responses is insufficient. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.

Abstract (translated)

大型语言模型(如ChatGPT)表现出非凡的语言理解和生成能力。尽管基于LLMs的标准评估者(即没有参考的评估者)比传统的基于参考的标准评估者表现更好的人类对齐，但在使用基于LLMs的标准评估者时仍面临许多挑战。基于LLMs的标准评估者更适合具有不同语义响应的开放式例子。但不是所有的例子都是开放式的。对于具有独特正确语义响应的封闭性例子，即使没有参考，基于LLMs的标准评估者仍然会认为其质量很高。为了全面评估基于LLMs的标准评估者的可靠性，我们分别基于KdConv和DSTC7-AVSD构建了两个对抗性元评估对话生成数据集KdConv-ADV和dstC7-ADV。与以前的元评估基准相比，KdConv-ADV和dstC7-ADV更具挑战性，因为它们要求评估者借助外部知识或甚至自己的知识合理评估封闭性例子。实证结果表明，LLMs识别不合理响应的能力还不够。使用基于LLMs的标准评估者评估对话响应的风险仍然存在。

URL

https://arxiv.org/abs/2305.14658

PDF

https://arxiv.org/pdf/2305.14658.pdf