Paper Reading AI Learner

Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom

2024-04-30 12:43:53
Shisen Yue, Siyuan Song, Xinyuan Cheng, Hai Hu

Abstract

Understanding the non-literal meaning of an utterance is critical for large language models (LLMs) to become human-like social communicators. In this work, we introduce SwordsmanImp, the first Chinese multi-turn-dialogue-based dataset aimed at conversational implicature, sourced from dialogues in the Chinese sitcom $\textit{My Own Swordsman}$. It includes 200 carefully handcrafted questions, all annotated on which Gricean maxims have been violated. We test eight close-source and open-source LLMs under two tasks: a multiple-choice question task and an implicature explanation task. Our results show that GPT-4 attains human-level accuracy (94%) on multiple-choice questions. CausalLM demonstrates a 78.5% accuracy following GPT-4. Other models, including GPT-3.5 and several open-source models, demonstrate a lower accuracy ranging from 20% to 60% on multiple-choice questions. Human raters were asked to rate the explanation of the implicatures generated by LLMs on their reasonability, logic and fluency. While all models generate largely fluent and self-consistent text, their explanations score low on reasonability except for GPT-4, suggesting that most LLMs cannot produce satisfactory explanations of the implicatures in the conversation. Moreover, we find LLMs' performance does not vary significantly by Gricean maxims, suggesting that LLMs do not seem to process implicatures derived from different maxims differently. Our data and code are available at this https URL.

Abstract (translated)

理解一个会话中的非字面意义对于大型语言模型(LLMs)成为具有人类水平的社交交际者至关重要。在这项工作中,我们引入了SwordsmanImp,第一个针对中国情景喜剧《我自己的刀剑侠》的中文多轮对话数据集,旨在实现会话含义。它包括200个精心制作的提问,所有这些都附有 Gricean maxims 的违约情况。我们对八个闭源和开源的LLM进行了两种任务测试:多选题问题和会话含义解释任务。我们的结果表明,GPT-4在多选题上的准确率达到了人类水平(94%)。CausalLM在GPT-4之后的准确率达到了78.5%。其他模型,包括GPT-3.5和几个开源模型,在多选题上的准确率较低,从20%到60%不等。人类评估者被要求根据LLM对会话含义的生成进行推理、逻辑和流畅性评分。虽然所有模型产生的文本都相当流畅且自相矛盾,但它们的解释得分在推理方面都很低,除了GPT-4,这表明大多数LLM无法产生满意的会话含义。此外,我们发现LLM的性能与Gricean maxims没有显著差异,这表明LLM似乎没有以不同的最大原则处理会话含义。我们的数据和代码可以从该链接的https:// URL中获取。

URL

https://arxiv.org/abs/2404.19509

PDF

https://arxiv.org/pdf/2404.19509.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot