Abstract
The fluency and factual knowledge of large language models (LLMs) heightens the need for corresponding systems to detect whether a piece of text is machine-written. For example, students may use LLMs to complete written assignments, leaving instructors unable to accurately assess student learning. In this paper, we first demonstrate that text sampled from an LLM tends to occupy negative curvature regions of the model's log probability function. Leveraging this observation, we then define a new curvature-based criterion for judging if a passage is generated from a given LLM. This approach, which we call DetectGPT, does not require training a separate classifier, collecting a dataset of real or generated passages, or explicitly watermarking generated text. It uses only log probabilities computed by the model of interest and random perturbations of the passage from another generic pre-trained language model (e.g, T5). We find DetectGPT is more discriminative than existing zero-shot methods for model sample detection, notably improving detection of fake news articles generated by 20B parameter GPT-NeoX from 0.81 AUROC for the strongest zero-shot baseline to 0.95 AUROC for DetectGPT. See this https URL for code, data, and other project information.
Abstract (translated)
大型语言模型(LLM)的流畅和事实知识增加了检测文本是否人为编写的系统需求。例如,学生可以使用LLM来完成书面作业,从而使教练无法准确评估学生学习。在本文中,我们首先证明了从LLM中采样的文字倾向于占据模型的log概率函数的负曲率区域。利用这一观察,我们然后定义了一种新的曲率基于标准差的标准来判断一段文本是否来自给定LLM。我们称之为DetectGPT,它不需要训练一个单独的分类器、收集真实或生成的文本数据集,或明确嵌入生成的文本。它仅使用由感兴趣的模型计算的log概率和随机扰动生成的文本的样本数据集(例如T5)。我们发现DetectGPT在模型样本检测方面比现有的零样本方法更为有效,特别是改进了由20B参数GPT-NeoX生成的假新闻文章的检测效果,从0.81 AUROC的最强的零样本基线到DetectGPT的0.95 AUROC。见这个httpsURL以代码、数据和其他项目信息。
URL
https://arxiv.org/abs/2301.11305