Paper Reading AI Learner

The Generative Energy Arena : Incorporating Energy Awareness in Large Language Model Human Evaluations

2025-07-17 17:11:14
Carlos Arriaga, Gonzalo Mart\'inez, Eneko Sendin, Javier Conde, Pedro Reviriego

Abstract

The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.

Abstract (translated)

大型语言模型的评估是一项复杂的任务,已经提出了多种方法。其中最常见的是使用自动基准测试,在这种测试中需要让LLM(大语言模型)回答不同主题的选择题。然而,这种方法存在一定的局限性,特别是与人类评价之间的相关性较差的问题最为突出。另一种替代方法是直接由人来评估这些模型,但这种方法面临可扩展性的挑战,因为待评估的模型数量庞大且不断增加,这使得基于招聘一定数量的评估者并对模型响应进行排名的传统研究变得不切实际(并且成本高昂)。一种备选的方法是在像流行的LM竞技场这样的公共平台上让任何用户都可以自由地评价模型并对其给出的回答进行排序。然后将结果综合成一个模型排名。 近年来,LLM的一个日益重要的方面是其能源消耗情况,并且因此评估关于人类在选择模型时对能源意识的影响变得越来越重要。在这篇论文中,我们介绍了GEA(生成式能源竞技场),这是一个引入了有关模型能耗信息的评价过程的平台。我们也展示了使用GEA获得的一些初步结果:数据显示,在大多数情况下,当用户了解了模型的能量消耗情况后,他们更倾向于选择小型且能效更高的模型。这表明在大多数用户交互中,较复杂和高性能模型所增加的成本与能量消耗并不足以提升回应的质量到可以证明其使用的程度。

URL

https://arxiv.org/abs/2507.13302

PDF

https://arxiv.org/pdf/2507.13302.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot