Abstract
Recent Large Language Models (LLMs) have shown the ability to generate content that is difficult or impossible to distinguish from human writing. We investigate the ability of differently-sized LLMs to replicate human writing style in short, creative texts in the domain of Showerthoughts, thoughts that may occur during mundane activities. We compare GPT-2 and GPT-Neo fine-tuned on Reddit data as well as GPT-3.5 invoked in a zero-shot manner, against human-authored texts. We measure human preference on the texts across the specific dimensions that account for the quality of creative, witty texts. Additionally, we compare the ability of humans versus fine-tuned RoBERTa classifiers to detect AI-generated texts. We conclude that human evaluators rate the generated texts slightly worse on average regarding their creative quality, but they are unable to reliably distinguish between human-written and AI-generated texts. We further provide a dataset for creative, witty text generation based on Reddit Showerthoughts posts.
Abstract (translated)
近年来,大型语言模型(LLMs)已经证明了生成难以或无法与人类写作区分的内容的能力。我们在Showerthoughts领域,即在普通活动中可能出现的想法,研究了不同大小的LLM是否具有复制人类写作风格的短小、有创意的文本的能力。我们将GPT-2和GPT-Neo在Reddit数据上微调以及通过零散的方式启动GPT-3.5,与人类创作的文章进行比较。我们在具体体现创意、幽默的文本质量的各个维度上衡量人类偏好。此外,我们还比较了人类和微调后的RoBERTa分类器在检测人工智能生成文本方面的能力。我们得出结论,人类评估者平均将生成的文本创意质量评价为略逊一筹,但他们无法可靠地区分人类创作和人工智能生成的文本。我们还基于Reddit Showerthoughts帖子提供了用于创意、幽默文本生成的数据集。
URL
https://arxiv.org/abs/2405.01660