ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Abstract
Abstract (translated)
URL
PDF

Abstract

Evaluating free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to gain insights into how LLMs evaluate explanations. We observed that replacing one of the human ratings sometimes maintained, but more often lowered the inter-annotator agreement across different settings and quality aspects, suggesting that their judgments are not always consistent with human raters. We further quantified this difference by comparing the correlation between LLM-generated ratings with majority-voted human ratings across different quality aspects. With the best system, Spearman's rank correlation ranged between 0.53 to 0.95, averaging 0.72 across aspects, indicating moderately high but imperfect alignment. Finally, we considered the alternative of using an LLM as an additional rater when human raters are scarce, and measured the correlation between majority-voted labels with a limited human pool and LLMs as an additional rater, compared to the original gold labels. While GPT-4 improved the outcome when there were only two human raters, in all other observed cases, LLMs were neutral to detrimental when there were three or more human raters. We publicly release the dataset to support future improvements in LLM-in-the-loop evaluation here: this https URL.

Abstract (translated)

评估自由文本解释是一个多方面、主观和劳动密集型任务。大型语言模型（LLMs）因其一致性、可扩展性和成本效益而具有吸引力。在这项工作中，我们提出了一个由3,500个自由文本解释和每个方面的质量评分组成的新数据集ACORN，并使用它来探讨LLMs如何评估解释。我们观察到，用一个人类评分替换另一个时，有时候会保持一致，但大多数情况下会降低不同设置和质量方面的互信度，表明他们的判断并不总是与人类评分者一致。我们进一步通过比较LLM生成的评分与不同质量方面的多数投票人类评分之间的相关性来量化这个差异。在最佳系统中，Spearman秩相关范围在0.53到0.95之间，平均为0.72，表明适度高但存在不完美的一致性。最后，我们考虑了当人类评分者稀缺时使用LLM作为额外评分者的替代方案，并测量了多数投票标签与有限人类池和LLM作为额外评分者之间的相关性。虽然GPT-4在只有两个人类评分者时提高了结果，但所有观察到的其他情况中，LLM在有三个或更多人类评分者时都是中立的，甚至是有害的。我们公开发布这个数据集，以支持未来在LLM循环评估中的改进：此https URL。

URL

https://arxiv.org/abs/2405.04818

PDF

https://arxiv.org/pdf/2405.04818.pdf

ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Abstract

Abstract (translated)

URL

PDF Copy

PDF