Abstract
Besides humans and machines, Artificial Intelligence (AI) models have emerged to be another important audience of programming languages, as we come to the era of large language models (LLMs). LLMs can now excel at coding competitions and even program like developers to address various tasks, such as math calculation. Yet, the grammar and layout of existing programs are designed for humans. Particularly, abundant grammar tokens and formatting tokens are included to make the code more readable to humans. While beneficial, such a human-centric design imposes an unnecessary computational burden on LLMs where each token, either consumed or generated, consumes computational resources. To improve inference efficiency and reduce computational costs, we propose the concept of AI-oriented grammar, which aims to represent the code in a way that better suits the working mechanism of AI models. Code written with AI-oriented grammar discards formats and uses a minimum number of tokens to convey code semantics effectively. To demonstrate the feasibility of this concept, we explore and implement the first AI-oriented grammar for Python, named Simple Python (SimPy). SimPy is crafted by revising the original Python grammar through a series of heuristic rules. Programs written in SimPy maintain identical Abstract Syntax Tree (AST) structures to those in standard Python, allowing execution via a modified AST parser. In addition, we explore methods to enable existing LLMs to proficiently understand and use SimPy, and ensure the changes remain imperceptible for human developers. Compared with the original Python, SimPy not only reduces token usage by 13.5% and 10.4% for CodeLlama and GPT-4, but can also achieve equivalent, even improved, performance over the models trained on Python code.
Abstract (translated)
除了人类和机器,人工智能(AI)模型已成为编程语言的另一个重要受众,我们进入到大语言模型(LLMs)的时代。LLMs现在在编程竞赛中表现出色,甚至可以像开发者一样编写程序来解决各种任务,例如数学计算。然而,现有的程序的语法和布局是为人类设计的。特别是,丰富的语法标记和格式化标记包括在内,使得代码对人类更易阅读。尽管这种以人类为中心的设计有益,但每个标记(无论是消耗还是生成)都消耗计算资源,从而对LLMs施加不必要的计算负担。为了提高推理效率和降低计算成本,我们提出了面向人工智能的语法概念,旨在以更好地适应AI模型的运行机制来表示代码。使用面向人工智能的语法编写的代码会丢弃格式,并使用最小的标记数有效地传达代码语义。为了证明这个概念的可行性,我们探讨并实现了第一个面向人工智能的语法Python,名为Simple Python(SimPy)。通过一系列启发式规则修改原始Python语法,Simple Pytho的程序与标准Python的程序具有相同的抽象语法树(AST)结构,可以通过修改后的AST解析器进行执行。此外,我们探讨了方法,使现有的LLM能够有效地理解和使用SimPy,并确保对人类开发者来说,变化是难以察觉的。与原始Python相比,Simple Pytho不仅减少了CodeLlama和GPT-4的标记使用量13.5%和10.4%,而且还可以实现与Python代码训练的模型相等甚至更好的性能。
URL
https://arxiv.org/abs/2404.16333