Paper Reading AI Learner

Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach

2024-04-22 15:54:53
Yao Wan, Guanghua Wan, Shijie Zhang, Hongyu Zhang, Yulei Sui, Pan Zhou, Hai Jin, Lichao Sun

Abstract

Recent years have witnessed significant progress in developing deep learning-based models for automated code completion. Although using source code in GitHub has been a common practice for training deep-learning-based models for code completion, it may induce some legal and ethical issues such as copyright infringement. In this paper, we investigate the legal and ethical issues of current neural code completion models by answering the following question: Is my code used to train your neural code completion model? To this end, we tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks to a more challenging task of code completion. In particular, since the target code completion models perform as opaque black boxes, preventing access to their training data and parameters, we opt to train multiple shadow models to mimic their behavior. The acquired posteriors from these shadow models are subsequently employed to train a membership classifier. Subsequently, the membership classifier can be effectively employed to deduce the membership status of a given code sample based on the output of a target code completion model. We comprehensively evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models, (i.e., LSTM-based, CodeGPT, CodeGen, and StarCoder). Experimental results reveal that the LSTM-based and CodeGPT models suffer the membership leakage issue, which can be easily detected by our proposed membership inference approach with an accuracy of 0.842, and 0.730, respectively. Interestingly, our experiments also show that the data membership of current large language models of code, e.g., CodeGen and StarCoder, is difficult to detect, leaving amper space for further improvement. Finally, we also try to explain the findings from the perspective of model memorization.

Abstract (translated)

近年来,在基于深度学习的自动编程补全模型的发展方面取得了显著的进展。虽然在GitHub使用源代码进行深度学习模型训练是一种常见做法,但可能会引发一些法律和道德问题,例如版权侵犯。在本文中,我们研究了当前神经代码补全模型的法律和道德问题,回答了一个问题:我的代码被用于训练您的神经代码补全模型吗?为此,我们将最初为分类任务设计的成员推断方法(称为CodeMI)适应更具挑战性的代码补全任务。特别是,由于目标代码补全模型表现为黑盒,阻止访问其训练数据和参数,我们选择训练多个影子模型以模仿其行为。这些影子模型的获得的概率随后被用于训练一个成员分类器。随后,成员分类器可以有效地用于根据目标代码补全模型的输出推断给定代码样本的成员状态。我们对这种自适应方法在各种神经代码补全模型上的效果进行全面评估(即基于LSTM的模型、基于CodeGPT的模型、基于CodeGen的模型和基于StarCoder的模型)。实验结果表明,基于LSTM和CodeGPT的模型存在成员泄漏问题,可以通过我们提出的成员推断方法以准确度为0.842和0.730进行检测。有趣的是,我们的实验还发现,当前大型语言模型的数据成员,例如CodeGen和StarCoder,很难检测,留下了一定的改进空间。最后,我们还从模型记忆的角度尝试解释这些发现。

URL

https://arxiv.org/abs/2404.14296

PDF

https://arxiv.org/pdf/2404.14296.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot