Abstract
LLM-based Multi-Agent (LLM-MA) systems are increasingly applied to automate complex software engineering tasks such as requirements engineering, code generation, and testing. However, their operational efficiency and resource consumption remain poorly understood, hindering practical adoption due to unpredictable costs and environmental impact. To address this, we conduct an analysis of token consumption patterns in an LLM-MA system within the Software Development Life Cycle (SDLC), aiming to understand where tokens are consumed across distinct software engineering activities. We analyze execution traces from 30 software development tasks performed by the ChatDev framework using a GPT-5 reasoning model, mapping its internal phases to distinct development stages (Design, Coding, Code Completion, Code Review, Testing, and Documentation) to create a standardized evaluation framework. We then quantify and compare token distribution (input, output, reasoning) across these stages. Our preliminary findings show that the iterative Code Review stage accounts for the majority of token consumption for an average of 59.4% of tokens. Furthermore, we observe that input tokens consistently constitute the largest share of consumption for an average of 53.9%, providing empirical evidence for potentially significant inefficiencies in agentic collaboration. Our results suggest that the primary cost of agentic software engineering lies not in initial code generation but in automated refinement and verification. Our novel methodology can help practitioners predict expenses and optimize workflows, and it directs future research toward developing more token-efficient agent collaboration protocols.
Abstract (translated)
基于大型语言模型的多代理系统(LLM-MA)在自动化复杂软件工程任务如需求工程、代码生成和测试方面得到了越来越广泛的应用。然而,这些系统的运行效率和资源消耗仍不为人们所充分了解,导致由于不可预测的成本和环境影响而难以实际应用。为了应对这一挑战,我们对LLM-MA系统在整个软件开发生命周期(SDLC)中的令牌消费模式进行了分析,旨在理解在不同的软件工程活动中令牌被如何使用。 我们的研究基于ChatDev框架执行的30个软件开发任务的数据,并利用GPT-5推理模型进行。我们将该系统的内部阶段映射到具体的开发阶段(设计、编码、代码完成、代码审查、测试和文档编写),以创建一个标准化评估框架。接着,我们量化并比较了这些阶段中的令牌分布情况(输入、输出、推理)。初步研究结果表明,在平均的59.4%的情况下,迭代的代码审查阶段消耗了最多的令牌。 此外,我们发现输入令牌在所有情况下占最大的比例,平均为53.9%,这提供了潜在重大效率低下现象的实际证据。我们的研究成果表明,代理软件工程的主要成本不在于初始代码生成,而在于自动化改进和验证过程中。 我们提出的新方法可以帮助实践者预测费用并优化工作流程,并将未来的研发方向引导至开发更高效的多代理协作协议上。
URL
https://arxiv.org/abs/2601.14470