Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach

Abstract
Abstract (translated)
URL
PDF

Abstract

Recent years have witnessed significant progress in developing deep learning-based models for automated code completion. Although using source code in GitHub has been a common practice for training deep-learning-based models for code completion, it may induce some legal and ethical issues such as copyright infringement. In this paper, we investigate the legal and ethical issues of current neural code completion models by answering the following question: Is my code used to train your neural code completion model? To this end, we tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks to a more challenging task of code completion. In particular, since the target code completion models perform as opaque black boxes, preventing access to their training data and parameters, we opt to train multiple shadow models to mimic their behavior. The acquired posteriors from these shadow models are subsequently employed to train a membership classifier. Subsequently, the membership classifier can be effectively employed to deduce the membership status of a given code sample based on the output of a target code completion model. We comprehensively evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models, (i.e., LSTM-based, CodeGPT, CodeGen, and StarCoder). Experimental results reveal that the LSTM-based and CodeGPT models suffer the membership leakage issue, which can be easily detected by our proposed membership inference approach with an accuracy of 0.842, and 0.730, respectively. Interestingly, our experiments also show that the data membership of current large language models of code, e.g., CodeGen and StarCoder, is difficult to detect, leaving amper space for further improvement. Finally, we also try to explain the findings from the perspective of model memorization.

Abstract (translated)

近年来，在基于深度学习的自动编程补全模型的发展方面取得了显著的进展。虽然在GitHub使用源代码进行深度学习模型训练是一种常见做法，但可能会引发一些法律和道德问题，例如版权侵犯。在本文中，我们研究了当前神经代码补全模型的法律和道德问题，回答了一个问题：我的代码被用于训练您的神经代码补全模型吗？为此，我们将最初为分类任务设计的成员推断方法（称为CodeMI）适应更具挑战性的代码补全任务。特别是，由于目标代码补全模型表现为黑盒，阻止访问其训练数据和参数，我们选择训练多个影子模型以模仿其行为。这些影子模型的获得的概率随后被用于训练一个成员分类器。随后，成员分类器可以有效地用于根据目标代码补全模型的输出推断给定代码样本的成员状态。我们对这种自适应方法在各种神经代码补全模型上的效果进行全面评估（即基于LSTM的模型、基于CodeGPT的模型、基于CodeGen的模型和基于StarCoder的模型）。实验结果表明，基于LSTM和CodeGPT的模型存在成员泄漏问题，可以通过我们提出的成员推断方法以准确度为0.842和0.730进行检测。有趣的是，我们的实验还发现，当前大型语言模型的数据成员，例如CodeGen和StarCoder，很难检测，留下了一定的改进空间。最后，我们还从模型记忆的角度尝试解释这些发现。

URL

https://arxiv.org/abs/2404.14296

PDF

https://arxiv.org/pdf/2404.14296.pdf

Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach

Abstract

Abstract (translated)

URL

PDF Copy

PDF