Abstract
The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.
Abstract (translated)
大型语言模型的评估是一项复杂的任务,已经提出了多种方法。其中最常见的是使用自动基准测试,在这种测试中需要让LLM(大语言模型)回答不同主题的选择题。然而,这种方法存在一定的局限性,特别是与人类评价之间的相关性较差的问题最为突出。另一种替代方法是直接由人来评估这些模型,但这种方法面临可扩展性的挑战,因为待评估的模型数量庞大且不断增加,这使得基于招聘一定数量的评估者并对模型响应进行排名的传统研究变得不切实际(并且成本高昂)。一种备选的方法是在像流行的LM竞技场这样的公共平台上让任何用户都可以自由地评价模型并对其给出的回答进行排序。然后将结果综合成一个模型排名。 近年来,LLM的一个日益重要的方面是其能源消耗情况,并且因此评估关于人类在选择模型时对能源意识的影响变得越来越重要。在这篇论文中,我们介绍了GEA(生成式能源竞技场),这是一个引入了有关模型能耗信息的评价过程的平台。我们也展示了使用GEA获得的一些初步结果:数据显示,在大多数情况下,当用户了解了模型的能量消耗情况后,他们更倾向于选择小型且能效更高的模型。这表明在大多数用户交互中,较复杂和高性能模型所增加的成本与能量消耗并不足以提升回应的质量到可以证明其使用的程度。
URL
https://arxiv.org/abs/2507.13302