Paper Reading AI Learner

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

2024-10-03 17:57:22
Ulyana Piterbarg, Lerrel Pinto, Rob Fergus

Abstract

Software engineers mainly write code by editing existing programs. In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of open-sourced edit data. While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs. It outputs edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks. We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines. This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems "pass@k" solved by any attempt given "k" tries. For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class. Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode.

Abstract (translated)

软件工程师主要通过编辑现有的程序来编写代码。相比之下,大型语言模型(LLMs)在一次性内合成程序。造成这种差异的一个原因是开源编辑数据的稀缺性。虽然高质量代码合成数据已经很稀缺了,但高质量编辑数据更加稀缺。为了填补这一空白,我们开发了一个名为LintSeq的合成数据生成算法。通过使用linter程序,它使用错误-free的插入来遍历可用于顺序编写程序的错误-free插入,将现有的代码重构为一系列代码编辑。它输出编辑序列作为文本字符串,由连续的程序差异组成。为了测试LintSeq,我们使用它将指令和程序对数据集重构为指令和程序差异对。然后,在重构和原始版本的了这个数据集上,我们微调了一系列参数从2.6B到14B的小型LLM,在代码合成基准测试中的零样本性能进行比较,比较基于零样本的性能。我们发现,在重复抽样过程中,经过优化的编辑序列模型产生的程序比基线更具有多样性。这使得基准测试的推理时间扩展更好,作为样本数的函数。例如,在HumanEval的pass@50上,通过合成编辑序列训练的小型LLM与GPT-4相当,并且在基准数据集上的绝对得分比基于基准数据集训练的模型快20%(+/-3%)。最后,我们还为代码理解预训练了自己的小型LM。我们发现,通过在合成代码编辑上微调小型模型,可以实现对于设备类模型的最先进的代码合成。我们的150M参数编辑序列LM与具有两倍参数的大型模型(包括重复抽样和Codex和AlphaCode)相匹敌或者更优。

URL

https://arxiv.org/abs/2410.02749

PDF

https://arxiv.org/pdf/2410.02749.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot