Paper Reading AI Learner

Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering

2026-01-20 20:52:14
Mohamad Salim, Jasmine Latendresse, SayedHassan Khatoonabadi, Emad Shihab

Abstract

LLM-based Multi-Agent (LLM-MA) systems are increasingly applied to automate complex software engineering tasks such as requirements engineering, code generation, and testing. However, their operational efficiency and resource consumption remain poorly understood, hindering practical adoption due to unpredictable costs and environmental impact. To address this, we conduct an analysis of token consumption patterns in an LLM-MA system within the Software Development Life Cycle (SDLC), aiming to understand where tokens are consumed across distinct software engineering activities. We analyze execution traces from 30 software development tasks performed by the ChatDev framework using a GPT-5 reasoning model, mapping its internal phases to distinct development stages (Design, Coding, Code Completion, Code Review, Testing, and Documentation) to create a standardized evaluation framework. We then quantify and compare token distribution (input, output, reasoning) across these stages. Our preliminary findings show that the iterative Code Review stage accounts for the majority of token consumption for an average of 59.4% of tokens. Furthermore, we observe that input tokens consistently constitute the largest share of consumption for an average of 53.9%, providing empirical evidence for potentially significant inefficiencies in agentic collaboration. Our results suggest that the primary cost of agentic software engineering lies not in initial code generation but in automated refinement and verification. Our novel methodology can help practitioners predict expenses and optimize workflows, and it directs future research toward developing more token-efficient agent collaboration protocols.

Abstract (translated)

基于大型语言模型的多代理系统(LLM-MA)在自动化复杂软件工程任务如需求工程、代码生成和测试方面得到了越来越广泛的应用。然而,这些系统的运行效率和资源消耗仍不为人们所充分了解,导致由于不可预测的成本和环境影响而难以实际应用。为了应对这一挑战,我们对LLM-MA系统在整个软件开发生命周期(SDLC)中的令牌消费模式进行了分析,旨在理解在不同的软件工程活动中令牌被如何使用。 我们的研究基于ChatDev框架执行的30个软件开发任务的数据,并利用GPT-5推理模型进行。我们将该系统的内部阶段映射到具体的开发阶段(设计、编码、代码完成、代码审查、测试和文档编写),以创建一个标准化评估框架。接着,我们量化并比较了这些阶段中的令牌分布情况(输入、输出、推理)。初步研究结果表明,在平均的59.4%的情况下,迭代的代码审查阶段消耗了最多的令牌。 此外,我们发现输入令牌在所有情况下占最大的比例,平均为53.9%,这提供了潜在重大效率低下现象的实际证据。我们的研究成果表明,代理软件工程的主要成本不在于初始代码生成,而在于自动化改进和验证过程中。 我们提出的新方法可以帮助实践者预测费用并优化工作流程,并将未来的研发方向引导至开发更高效的多代理协作协议上。

URL

https://arxiv.org/abs/2601.14470

PDF

https://arxiv.org/pdf/2601.14470.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot