ChatGPT or Human? Detect and Explain. Explaining Decisions of Machine Learning Model for Detecting Short ChatGPT-generated Text

Abstract
Abstract (translated)
URL
PDF

Abstract

ChatGPT has the ability to generate grammatically flawless and seemingly-human replies to different types of questions from various domains. The number of its users and of its applications is growing at an unprecedented rate. Unfortunately, use and abuse come hand in hand. In this paper, we study whether a machine learning model can be effectively trained to accurately distinguish between original human and seemingly human (that is, ChatGPT-generated) text, especially when this text is short. Furthermore, we employ an explainable artificial intelligence framework to gain insight into the reasoning behind the model trained to differentiate between ChatGPT-generated and human-generated text. The goal is to analyze model's decisions and determine if any specific patterns or characteristics can be identified. Our study focuses on short online reviews, conducting two experiments comparing human-generated and ChatGPT-generated text. The first experiment involves ChatGPT text generated from custom queries, while the second experiment involves text generated by rephrasing original human-generated reviews. We fine-tune a Transformer-based model and use it to make predictions, which are then explained using SHAP. We compare our model with a perplexity score-based approach and find that disambiguation between human and ChatGPT-generated reviews is more challenging for the ML model when using rephrased text. However, our proposed approach still achieves an accuracy of 79%. Using explainability, we observe that ChatGPT's writing is polite, without specific details, using fancy and atypical vocabulary, impersonal, and typically it does not express feelings.

Abstract (translated)

ChatGPT 能够生成语法完美、看似人类的回复,来自不同领域各种类型的问题。其用户和应用程序数量正在以前所未有的速度增长。不幸的是,使用和滥用随之而来的。在本文中,我们研究是否存在一种机器学习模型能够有效地训练,准确地区分原始人类和看似人类的(即 ChatGPT 生成的)文本,特别是在文本较短的情况下。我们还使用可解释的人工智能框架来深入了解训练模型以区分 ChatGPT 生成的和人类生成的文本背后的推理。目标是分析模型的决策,并确定是否存在任何特定的模式或特征可以识别。我们的研究关注简短的在线评论,进行两次实验,比较人类生成的和 ChatGPT 生成的文本。第一次实验涉及从自定义查询生成的 ChatGPT 文本,第二次实验涉及用人类生成的评论重新表述的文本。我们优化了基于Transformer 的模型,并使用它进行预测,然后用 SHAP 解释。我们比较了我们模型与基于混淆度分数的方法,并发现,使用重新表述的文本对机器学习模型来说,区分人类和 ChatGPT 生成的评论更具挑战性。然而,我们提出的方法仍实现了 79% 的准确率。利用解释性,我们观察到 ChatGPT 的写作风格礼貌,缺乏具体细节,使用华丽和罕见的词汇,冷漠,通常它不表达情感。

URL

https://arxiv.org/abs/2301.13852

PDF

https://arxiv.org/pdf/2301.13852.pdf