Paper Reading AI Learner

Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense

2023-03-23 16:29:27
Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, Mohit Iyyer

Abstract

To detect the deployment of large language models for malicious use cases (e.g., fake content creation or academic plagiarism), several approaches have recently been proposed for identifying AI-generated text via watermarks or statistical irregularities. How robust are these detection algorithms to paraphrases of AI-generated text? To stress test these detectors, we first train an 11B parameter paraphrase generation model (DIPPER) that can paraphrase paragraphs, optionally leveraging surrounding text (e.g., user-written prompts) as context. DIPPER also uses scalar knobs to control the amount of lexical diversity and reordering in the paraphrases. Paraphrasing text generated by three large language models (including GPT3.5-davinci-003) with DIPPER successfully evades several detectors, including watermarking, GPTZero, DetectGPT, and OpenAI's text classifier. For example, DIPPER drops the detection accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. To increase the robustness of AI-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider. Given a candidate text, our algorithm searches a database of sequences previously generated by the API, looking for sequences that match the candidate text within a certain threshold. We empirically verify our defense using a database of 15M generations from a fine-tuned T5-XXL model and find that it can detect 80% to 97% of paraphrased generations across different settings, while only classifying 1% of human-written sequences as AI-generated. We will open source our code, model and data for future research.

Abstract (translated)

检测大型语言模型用于恶意使用 case(例如虚假内容创建或学术抄袭),有几个方法最近被提出以通过水印或统计不规则来确定由人工智能生成的文字。这些检测算法对人工智能生成文字的重写攻击的鲁棒性如何?为了压力测试这些检测算法,我们首先训练了一个11B参数的重写生成模型(DIPPER),该模型可以重写段落,并可选地利用周围的文本(例如用户编写的提示)作为上下文。DIPPER还使用 scalar knobs 控制重写句中的词汇多样性和排序。由三个大型语言模型(包括 GPT3.5-davinci-003)生成的重写文本,使用 DIPPER成功地逃避了多个检测器,包括水印、GPTZero、DetectGPT和OpenAI的文字分类器。例如,DIPPER将检测到 DetectGPT的检测准确率从70.3%降低到4.6%(在 constant false positive rate of 1% 不变的情况下),而不会对输入语义性有任何显著影响。为了增加人工智能生成文字检测重写攻击的鲁棒性,我们引入了一种简单的防御措施,它依赖于从语言模型 API 中提取语义相似的生成,必须由语言模型 API 提供商维护。给定一个候选文本,我们的算法搜索先前由 API 生成的序列数据库,寻找匹配候选文本在一定阈值内的序列。我们使用微调的 T5-XXL 模型的15M 生成序列数据库进行经验验证,发现它可以在不同设置下检测到80%至97%的重写生成序列,而仅将人类编写的序列归类为人工智能生成。我们将开源我们的代码、模型和数据,以供未来的研究。

URL

https://arxiv.org/abs/2303.13408

PDF

https://arxiv.org/pdf/2303.13408.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot