Paper Reading AI Learner

MdEval: Massively Multilingual Code Debugging

2024-11-04 17:36:40
Shukai Liu, Linzheng Chai, Jian Yang, Jiajun Shi, He Zhu, Liran Wang, Ke Jin, Wei Zhang, Hualei Zhu, Shuyue Guo, Tao Sun, Jiaheng Liu, Yunlong Duan, Yu Hao, Liqun Yang, Guanglin Niu, Ge Zhang, Zhoujun Li

Abstract

Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.6K test samples of 18 programming languages and covers the automated program repair (APR) task, the code review (CR) task, and the bug identification (BI) task. Further, we introduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong baseline specifically to handle the bugs of a wide range of programming languages (e.g. "Missing Mut" in language Rust and "Misused Macro Definition" in language C). Our extensive experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs (e.g., GPT and Claude series), highlighting huge room for improvement in multilingual code debugging scenarios.

Abstract (translated)

代码大型语言模型(LLMs)在直接基于错误的代码片段生成正确代码方面取得了显著进展,从而改进了代码调试。编程基准测试通常包含有缺陷的代码片段及其相关测试用例,用于评估LLMs的调试能力。然而,许多现有的基准主要集中在Python上,并且在语言多样性方面往往有限(例如DebugBench和DebugEval)。为了推进多语言环境下的调试领域,我们提出了第一个大规模多语言调试基准,该基准包含18种编程语言的3.6K个测试样本,并涵盖了自动化程序修复(APR)任务、代码审查(CR)任务以及缺陷识别(BI)任务。此外,通过将错误注入到正确的多语言查询和解决方案(xDebugGen)中,我们引入了调试指令语料库MDEVAL-INSTRUCT。进一步地,基于MDEVAL-INSTRUCT训练了一个多语言调试器xDebugCoder作为强大的基线模型,专门用于处理各种编程语言的bug(例如,在Rust语言中的"Missing Mut"和C语言中的"Misused Macro Definition")。我们在MDEVAL上的广泛实验表明,开源模型与闭源LLMs(如GPT和Claude系列)之间存在显著性能差距,这突显了在多语言代码调试场景中存在巨大的改进空间。

URL

https://arxiv.org/abs/2411.02310

PDF

https://arxiv.org/pdf/2411.02310.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot