Abstract
We systematically study the capacity of two large language models for code - CodeT5 and Codex - to generalize to out-of-domain data. In this study, we consider two fundamental applications - code summarization, and code generation. We split data into domains following its natural boundaries - by an organization, by a project, and by a module within the software project. This makes recognition of in-domain vs out-of-domain data at the time of deployment trivial. We establish that samples from each new domain present both models with a significant challenge of distribution shift. We study how well different established methods can adapt models to better generalize to new domains. Our experiments show that while multitask learning alone is a reasonable baseline, combining it with few-shot finetuning on examples retrieved from training data can achieve very strong performance. In fact, according to our experiments, this solution can outperform direct finetuning for very low-data scenarios. Finally, we consider variations of this approach to create a more broadly applicable method to adapt to multiple domains at once. We find that in the case of code generation, a model adapted to multiple domains simultaneously performs on par with those adapted to each domain individually.
Abstract (translated)
我们系统研究了两个大型语言模型 CodeT5 和 Codex 对跨域数据 generalization 的能力。在这项研究中,我们考虑了两个基本应用:代码摘要和代码生成。我们将数据按照其自然边界——由组织、项目和软件项目中模块——分阶段划分到不同的领域。这使得在部署时识别跨域数据和跨域数据是非常简单的。我们确定每个新领域的样本都同时给两个模型带来了分布 shift 的重大挑战。我们研究不同建立的方法如何适应模型更好地泛化到新领域。我们的实验表明,虽然多任务学习单独是一种合理的基准,但将它在从训练数据中提取的示例上的少量微调相结合可以实现非常强大的性能。事实上,根据我们的实验, this 解决方案可以在非常低数据量的情况下优于直接微调。最后,我们考虑这种方法的变体,以创建一种更普遍适用的方法,一次性适应多个领域。我们发现,在代码生成的情况下,适应多个领域的模型同时与适应每个领域的模型表现相当。
URL
https://arxiv.org/abs/2303.09128