Paper Reading AI Learner

Determining Energy Efficiency Sweet Spots in Production LLM Inference

2026-02-05 14:21:00
Hiari Pizzini Cavagna, Andrea Proia, Giacomo Madella, Giovanni B. Esposito, Francesco Antici, Daniele Cesarini, Zeynep Kiziltan, Andrea Bartolini

Abstract

Large Language Models (LLMs) inference is central in modern AI applications, making it critical to understand their energy footprint. Existing approaches typically estimate energy consumption through simple linear functions of input and output sequence lengths, yet our observations reveal clear Energy Efficiency regimes: peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs, indicating a non-linear dependency. In this work, we propose an analytical model derived from the computational and memory-access complexity of the Transformer architecture, capable of accurately characterizing the efficiency curve as a function of input and output lengths. To assess its accuracy, we evaluate energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite, tested over input and output lengths from 64 to 4096 tokens, achieving a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency "Sweet Spots" can substantially reduce energy usage, supporting informed truncation, summarization, and adaptive generation strategies in production systems.

Abstract (translated)

大型语言模型(LLMs)的推理是现代人工智能应用的核心,因此理解它们的能量消耗足迹至关重要。现有的方法通常通过输入和输出序列长度的简单线性函数来估算能耗,但我们的观察揭示了清晰的能量效率模式:在较短至中等长度的输入和中等长度的输出时达到峰值效率;而对于较长输入或非常短输出的情况,能量效率则会急剧下降,表明存在非线性的依赖关系。在这项工作中,我们提出了一种基于Transformer架构计算复杂性和内存访问复杂性而衍生出的分析模型,该模型能够精确地描述以输入和输出长度为变量的能量效率曲线。为了评估其准确性,我们在NVIDIA H100 GPU上使用TensorRT-LLM对一系列参数从10亿到90亿、包括OPT、LLaMA、Gemma、Falcon、Qwen2和Granite等的大型语言模型进行了能耗测试,在输入输出长度从64到4096个token的不同组合中,实现了平均绝对百分比误差(MAPE)为1.79%的结果。我们的结果显示,将序列长度与这些效率“最佳区间”对齐可以大幅减少能量消耗,从而支持生产系统中的知情截断、摘要和自适应生成策略。 这段翻译概述了该研究的主要发现和贡献:通过分析大型语言模型的能量使用模式,并提出了一种新的基于Transformer架构复杂性的能耗预测模型。实验结果证明了模型的准确性,并展示了如何根据这些效率模式优化序列长度,从而在生产环境中减少能源消耗。

URL

https://arxiv.org/abs/2602.05695

PDF

https://arxiv.org/pdf/2602.05695.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot