Abstract
We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
Abstract (translated)
我们介绍了MILE-RefHumEval,这是一个无需真实标注或评估者协调的大型语言模型(LLMs)无参考评估框架。该框架利用一组由人类一致性的模式指导的独立提示评估器,支持离散和连续评分判断。通过从最佳候选选择、摘要生成到图像描述和对话等特定任务的提示,MILE-RefHumEval 提供了灵活、可解释且可扩展的评估方式。实验表明,该框架与人类判断高度一致,优于先前的方法,并减少了计算开销,为现实世界中LLMs 的评估提供了一个高效、稳健且符合人类标准的解决方案。
URL
https://arxiv.org/abs/2602.09624