Abstract
Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.6K test samples of 18 programming languages and covers the automated program repair (APR) task, the code review (CR) task, and the bug identification (BI) task. Further, we introduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong baseline specifically to handle the bugs of a wide range of programming languages (e.g. "Missing Mut" in language Rust and "Misused Macro Definition" in language C). Our extensive experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs (e.g., GPT and Claude series), highlighting huge room for improvement in multilingual code debugging scenarios.
Abstract (translated)
代码大型语言模型(LLMs)在直接基于错误的代码片段生成正确代码方面取得了显著进展,从而改进了代码调试。编程基准测试通常包含有缺陷的代码片段及其相关测试用例,用于评估LLMs的调试能力。然而,许多现有的基准主要集中在Python上,并且在语言多样性方面往往有限(例如DebugBench和DebugEval)。为了推进多语言环境下的调试领域,我们提出了第一个大规模多语言调试基准,该基准包含18种编程语言的3.6K个测试样本,并涵盖了自动化程序修复(APR)任务、代码审查(CR)任务以及缺陷识别(BI)任务。此外,通过将错误注入到正确的多语言查询和解决方案(xDebugGen)中,我们引入了调试指令语料库MDEVAL-INSTRUCT。进一步地,基于MDEVAL-INSTRUCT训练了一个多语言调试器xDebugCoder作为强大的基线模型,专门用于处理各种编程语言的bug(例如,在Rust语言中的"Missing Mut"和C语言中的"Misused Macro Definition")。我们在MDEVAL上的广泛实验表明,开源模型与闭源LLMs(如GPT和Claude系列)之间存在显著性能差距,这突显了在多语言代码调试场景中存在巨大的改进空间。
URL
https://arxiv.org/abs/2411.02310