Abstract
Large Language Models (LLMs) inference is central in modern AI applications, making it critical to understand their energy footprint. Existing approaches typically estimate energy consumption through simple linear functions of input and output sequence lengths, yet our observations reveal clear Energy Efficiency regimes: peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs, indicating a non-linear dependency. In this work, we propose an analytical model derived from the computational and memory-access complexity of the Transformer architecture, capable of accurately characterizing the efficiency curve as a function of input and output lengths. To assess its accuracy, we evaluate energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite, tested over input and output lengths from 64 to 4096 tokens, achieving a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency "Sweet Spots" can substantially reduce energy usage, supporting informed truncation, summarization, and adaptive generation strategies in production systems.
Abstract (translated)
大型语言模型(LLMs)的推理是现代人工智能应用的核心,因此理解它们的能量消耗足迹至关重要。现有的方法通常通过输入和输出序列长度的简单线性函数来估算能耗,但我们的观察揭示了清晰的能量效率模式:在较短至中等长度的输入和中等长度的输出时达到峰值效率;而对于较长输入或非常短输出的情况,能量效率则会急剧下降,表明存在非线性的依赖关系。在这项工作中,我们提出了一种基于Transformer架构计算复杂性和内存访问复杂性而衍生出的分析模型,该模型能够精确地描述以输入和输出长度为变量的能量效率曲线。为了评估其准确性,我们在NVIDIA H100 GPU上使用TensorRT-LLM对一系列参数从10亿到90亿、包括OPT、LLaMA、Gemma、Falcon、Qwen2和Granite等的大型语言模型进行了能耗测试,在输入输出长度从64到4096个token的不同组合中,实现了平均绝对百分比误差(MAPE)为1.79%的结果。我们的结果显示,将序列长度与这些效率“最佳区间”对齐可以大幅减少能量消耗,从而支持生产系统中的知情截断、摘要和自适应生成策略。 这段翻译概述了该研究的主要发现和贡献:通过分析大型语言模型的能量使用模式,并提出了一种新的基于Transformer架构复杂性的能耗预测模型。实验结果证明了模型的准确性,并展示了如何根据这些效率模式优化序列长度,从而在生产环境中减少能源消耗。
URL
https://arxiv.org/abs/2602.05695