Paper Reading AI Learner

CodeRAG-Bench: Can Retrieval Augment Code Generation?

2024-06-20 16:59:52
Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, Daniel Fried

Abstract

While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving code generation remains under-explored. In this work, we conduct a systematic, large-scale analysis by asking: in what scenarios can retrieval benefit code generation models? and what challenges remain? We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks, including basic programming, open-domain, and repository-level problems. We aggregate documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources. While notable gains are made in final code generation by retrieving high-quality contexts across various settings, our analysis reveals room for improvement -- current retrievers still struggle to fetch useful contexts especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. We hope CodeRAG-Bench serves as an effective testbed to encourage further development of advanced code-oriented RAG methods.

Abstract (translated)

虽然语言模型(LMs)已经在生成代码方面表现出了惊人的能力,但许多程序单独使用参数知识很难让LMs生成。提供诸如库文档等外部上下文可以促进生成准确和功能良好的代码。尽管在各种文本相关任务中使用检索增强生成(RAG)取得了成功,但改善代码生成的潜力仍有待探讨。在这项工作中,我们通过询问:在什么情况下查询可以对代码生成模型有益?以及什么挑战仍然存在?我们首先整理了一个包含基本编程、开放领域和仓库级别问题的全面的评估基准,CodeRAG-Bench。我们从五个来源聚合了模型需要检索的上下文:竞赛解决方案、在线教程、库文档、StackOverflow帖子、GitHub存储库。我们通过从不同来源检索上下文来评估CodeRAG-Bench中的顶级模型。虽然通过检索高质量上下文在各种设置中取得显著的进步,但我们的分析发现还有改进的空间——目前的检索器在有限词义重叠的情况下仍然很难获取有用的上下文,而生成器在有限的可扩展上下文长度或集成附加上下文的能力方面也存在问题。我们希望CodeRAG-Bench成为进一步发展高级代码导向RAG方法的有效测试平台。

URL

https://arxiv.org/abs/2406.14497

PDF

https://arxiv.org/pdf/2406.14497.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot