Paper Reading AI Learner

AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code Generation

2024-04-25 04:46:02
Zhensu Sun, Xiaoning Du, Zhou Yang, Li Li, David Lo

Abstract

Besides humans and machines, Artificial Intelligence (AI) models have emerged to be another important audience of programming languages, as we come to the era of large language models (LLMs). LLMs can now excel at coding competitions and even program like developers to address various tasks, such as math calculation. Yet, the grammar and layout of existing programs are designed for humans. Particularly, abundant grammar tokens and formatting tokens are included to make the code more readable to humans. While beneficial, such a human-centric design imposes an unnecessary computational burden on LLMs where each token, either consumed or generated, consumes computational resources. To improve inference efficiency and reduce computational costs, we propose the concept of AI-oriented grammar, which aims to represent the code in a way that better suits the working mechanism of AI models. Code written with AI-oriented grammar discards formats and uses a minimum number of tokens to convey code semantics effectively. To demonstrate the feasibility of this concept, we explore and implement the first AI-oriented grammar for Python, named Simple Python (SimPy). SimPy is crafted by revising the original Python grammar through a series of heuristic rules. Programs written in SimPy maintain identical Abstract Syntax Tree (AST) structures to those in standard Python, allowing execution via a modified AST parser. In addition, we explore methods to enable existing LLMs to proficiently understand and use SimPy, and ensure the changes remain imperceptible for human developers. Compared with the original Python, SimPy not only reduces token usage by 13.5% and 10.4% for CodeLlama and GPT-4, but can also achieve equivalent, even improved, performance over the models trained on Python code.

Abstract (translated)

除了人类和机器,人工智能(AI)模型已成为编程语言的另一个重要受众,我们进入到大语言模型(LLMs)的时代。LLMs现在在编程竞赛中表现出色,甚至可以像开发者一样编写程序来解决各种任务,例如数学计算。然而,现有的程序的语法和布局是为人类设计的。特别是,丰富的语法标记和格式化标记包括在内,使得代码对人类更易阅读。尽管这种以人类为中心的设计有益,但每个标记(无论是消耗还是生成)都消耗计算资源,从而对LLMs施加不必要的计算负担。为了提高推理效率和降低计算成本,我们提出了面向人工智能的语法概念,旨在以更好地适应AI模型的运行机制来表示代码。使用面向人工智能的语法编写的代码会丢弃格式,并使用最小的标记数有效地传达代码语义。为了证明这个概念的可行性,我们探讨并实现了第一个面向人工智能的语法Python,名为Simple Python(SimPy)。通过一系列启发式规则修改原始Python语法,Simple Pytho的程序与标准Python的程序具有相同的抽象语法树(AST)结构,可以通过修改后的AST解析器进行执行。此外,我们探讨了方法,使现有的LLM能够有效地理解和使用SimPy,并确保对人类开发者来说,变化是难以察觉的。与原始Python相比,Simple Pytho不仅减少了CodeLlama和GPT-4的标记使用量13.5%和10.4%,而且还可以实现与Python代码训练的模型相等甚至更好的性能。

URL

https://arxiv.org/abs/2404.16333

PDF

https://arxiv.org/pdf/2404.16333.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot