Abstract
To detect the deployment of large language models for malicious use cases (e.g., fake content creation or academic plagiarism), several approaches have recently been proposed for identifying AI-generated text via watermarks or statistical irregularities. How robust are these detection algorithms to paraphrases of AI-generated text? To stress test these detectors, we first train an 11B parameter paraphrase generation model (DIPPER) that can paraphrase paragraphs, optionally leveraging surrounding text (e.g., user-written prompts) as context. DIPPER also uses scalar knobs to control the amount of lexical diversity and reordering in the paraphrases. Paraphrasing text generated by three large language models (including GPT3.5-davinci-003) with DIPPER successfully evades several detectors, including watermarking, GPTZero, DetectGPT, and OpenAI's text classifier. For example, DIPPER drops the detection accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. To increase the robustness of AI-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider. Given a candidate text, our algorithm searches a database of sequences previously generated by the API, looking for sequences that match the candidate text within a certain threshold. We empirically verify our defense using a database of 15M generations from a fine-tuned T5-XXL model and find that it can detect 80% to 97% of paraphrased generations across different settings, while only classifying 1% of human-written sequences as AI-generated. We will open source our code, model and data for future research.
Abstract (translated)
检测大型语言模型用于恶意使用 case(例如虚假内容创建或学术抄袭),有几个方法最近被提出以通过水印或统计不规则来确定由人工智能生成的文字。这些检测算法对人工智能生成文字的重写攻击的鲁棒性如何?为了压力测试这些检测算法,我们首先训练了一个11B参数的重写生成模型(DIPPER),该模型可以重写段落,并可选地利用周围的文本(例如用户编写的提示)作为上下文。DIPPER还使用 scalar knobs 控制重写句中的词汇多样性和排序。由三个大型语言模型(包括 GPT3.5-davinci-003)生成的重写文本,使用 DIPPER成功地逃避了多个检测器,包括水印、GPTZero、DetectGPT和OpenAI的文字分类器。例如,DIPPER将检测到 DetectGPT的检测准确率从70.3%降低到4.6%(在 constant false positive rate of 1% 不变的情况下),而不会对输入语义性有任何显著影响。为了增加人工智能生成文字检测重写攻击的鲁棒性,我们引入了一种简单的防御措施,它依赖于从语言模型 API 中提取语义相似的生成,必须由语言模型 API 提供商维护。给定一个候选文本,我们的算法搜索先前由 API 生成的序列数据库,寻找匹配候选文本在一定阈值内的序列。我们使用微调的 T5-XXL 模型的15M 生成序列数据库进行经验验证,发现它可以在不同设置下检测到80%至97%的重写生成序列,而仅将人类编写的序列归类为人工智能生成。我们将开源我们的代码、模型和数据,以供未来的研究。
URL
https://arxiv.org/abs/2303.13408