Abstract
The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks. These advancements have extended into the domain of code, facilitating complex tasks such as code generation, translation, summarization, and repair. However, their utility for real-world deployment in-the-wild has only recently been studied, particularly on software engineering (SWE) tasks such as GitHub issue resolution. In this study, we examine the code reasoning techniques that underlie the ability to perform such tasks, and examine the paradigms used to drive their performance. Our contributions in this paper are: (1) the first dedicated survey on code reasoning for code tasks, highlighting overarching strategies, hybrid and agentic approaches; (2) a taxonomy of various techniques used to drive code reasoning; (3) a comprehensive overview of performance on common benchmarks and a showcase of new, under-explored benchmarks with high potential in SWE; (4) an exploration on how core properties of code can be used to explain different reasoning techniques; and (5) gaps and potentially under-explored areas for future research.
Abstract (translated)
大型语言模型(LLMs)的兴起已在广泛范围内的自然语言任务中带来了显著改进。这些进展扩展到了代码领域,使得复杂的任务如代码生成、翻译、总结和修复成为可能。然而,它们在现实世界中的实际部署研究才刚刚开始,特别是在软件工程(SWE)任务上,例如GitHub问题解决。在这项研究中,我们探讨了执行此类任务所依赖的代码推理技术,并考察了驱动其性能的各种范式。本文的贡献包括: 1. 首个专门针对代码任务中的代码推理进行调查的综述,强调总体策略、混合和代理方法; 2. 各种用于推动代码推理的技术分类学; 3. 常见基准测试中表现的整体概述及展示具有高潜力但在SWE领域尚待探索的新基准测试; 4. 探讨如何利用代码的基本属性来解释不同的推理技术; 5. 未来研究中的空白和潜在未被充分探索的领域。
URL
https://arxiv.org/abs/2506.13932