Abstract
Stories are a fundamental aspect of human experience. Engaging deeply with stories and spotting plot holes -- inconsistencies in a storyline that break the internal logic or rules of a story's world -- requires nuanced reasoning skills, including tracking entities and events and their interplay, abstract thinking, pragmatic narrative understanding, commonsense and social reasoning, and theory of mind. As Large Language Models (LLMs) increasingly generate, interpret, and modify text, rigorously assessing their narrative consistency and deeper language understanding becomes critical. However, existing benchmarks focus mainly on surface-level comprehension. In this work, we propose plot hole detection in stories as a proxy to evaluate language understanding and reasoning in LLMs. We introduce FlawedFictionsMaker, a novel algorithm to controllably and carefully synthesize plot holes in human-written stories. Using this algorithm, we construct a benchmark to evaluate LLMs' plot hole detection abilities in stories -- FlawedFictions -- , which is robust to contamination, with human filtering ensuring high quality. We find that state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless of the reasoning effort allowed, with performance significantly degrading as story length increases. Finally, we show that LLM-based story summarization and story generation are prone to introducing plot holes, with more than 50% and 100% increases in plot hole detection rates with respect to human-written originals.
Abstract (translated)
故事是人类体验中的一个基本方面。深入地参与故事并发现情节漏洞——即故事线中破坏内部逻辑或故事世界规则的不一致之处,需要精细的推理能力,包括追踪实体和事件及其相互作用、抽象思维、实用性的叙事理解、常识和社会推理以及理论心理。随着大型语言模型(LLMs)越来越多地生成、解释和修改文本,严格评估其叙述一致性与更深层次的语言理解变得至关重要。然而,现有的基准测试主要关注表面级的理解。 在本文中,我们提出将情节漏洞检测作为衡量大型语言模型语言理解和推理能力的一个代理方法。我们引入了一个名为FlawedFictionsMaker的新算法,该算法可以控制且细致地在人类撰写的故事情节中合成情节漏洞。利用这一算法,我们构建了用于评估LLM在故事中发现情节漏洞的能力的基准测试——FlawedFictions,该基准具备抗污染性,并通过人工筛选确保高质量。我们发现最先进的大型语言模型无论允许多少推理努力,在准确解决FlawedFictions的问题上都面临挑战,并且随着故事情节长度的增加,性能显著下降。 最后,我们展示了基于LLM的故事摘要和故事生成容易引入情节漏洞,与人类撰写的原始作品相比,前者的情节漏洞检测率分别增加了超过50%和100%。
URL
https://arxiv.org/abs/2504.11900