Paper Reading AI Learner

Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection

2025-04-16 09:25:54
Kabir Ahuja, Melanie Sclar, Yulia Tsvetkov

Abstract

Stories are a fundamental aspect of human experience. Engaging deeply with stories and spotting plot holes -- inconsistencies in a storyline that break the internal logic or rules of a story's world -- requires nuanced reasoning skills, including tracking entities and events and their interplay, abstract thinking, pragmatic narrative understanding, commonsense and social reasoning, and theory of mind. As Large Language Models (LLMs) increasingly generate, interpret, and modify text, rigorously assessing their narrative consistency and deeper language understanding becomes critical. However, existing benchmarks focus mainly on surface-level comprehension. In this work, we propose plot hole detection in stories as a proxy to evaluate language understanding and reasoning in LLMs. We introduce FlawedFictionsMaker, a novel algorithm to controllably and carefully synthesize plot holes in human-written stories. Using this algorithm, we construct a benchmark to evaluate LLMs' plot hole detection abilities in stories -- FlawedFictions -- , which is robust to contamination, with human filtering ensuring high quality. We find that state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless of the reasoning effort allowed, with performance significantly degrading as story length increases. Finally, we show that LLM-based story summarization and story generation are prone to introducing plot holes, with more than 50% and 100% increases in plot hole detection rates with respect to human-written originals.

Abstract (translated)

故事是人类体验中的一个基本方面。深入地参与故事并发现情节漏洞——即故事线中破坏内部逻辑或故事世界规则的不一致之处,需要精细的推理能力,包括追踪实体和事件及其相互作用、抽象思维、实用性的叙事理解、常识和社会推理以及理论心理。随着大型语言模型(LLMs)越来越多地生成、解释和修改文本,严格评估其叙述一致性与更深层次的语言理解变得至关重要。然而,现有的基准测试主要关注表面级的理解。 在本文中,我们提出将情节漏洞检测作为衡量大型语言模型语言理解和推理能力的一个代理方法。我们引入了一个名为FlawedFictionsMaker的新算法,该算法可以控制且细致地在人类撰写的故事情节中合成情节漏洞。利用这一算法,我们构建了用于评估LLM在故事中发现情节漏洞的能力的基准测试——FlawedFictions,该基准具备抗污染性,并通过人工筛选确保高质量。我们发现最先进的大型语言模型无论允许多少推理努力,在准确解决FlawedFictions的问题上都面临挑战,并且随着故事情节长度的增加,性能显著下降。 最后,我们展示了基于LLM的故事摘要和故事生成容易引入情节漏洞,与人类撰写的原始作品相比,前者的情节漏洞检测率分别增加了超过50%和100%。

URL

https://arxiv.org/abs/2504.11900

PDF

https://arxiv.org/pdf/2504.11900.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot