Paper Reading AI Learner

From Form to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

2024-04-18 12:48:17
Xenia Ohmer, Elia Bruni, Dieuwke Hupkes

Abstract

The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what "understanding" means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes - inspired by Fregean senses - of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model's multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.

Abstract (translated)

测量大型语言模型(LLMs)能力增加的速度,通过一系列常用的自然语言理解(NLU)基准进行衡量,引发了许多关于语言模型“理解”的问题,以及它与人类理解的比较。这在许多LLM仅基于文本训练的事实下更是如此,让人怀疑这些卓越的基准表现是否反映了真正理解这些问题,或者是否只是LLM在表达与理解问题相关的文本形式方面表现出色。在深受哲学启发的此项工作中,我们旨在创造一些形式与意义之间的区分,通过一系列利用 Fregean 感知的思想进行测试,其中包括跨语言的一致性以及同义词。具体来说,我们关注跨语言的一致性以及同义词。以GPT-3.5为研究对象,我们在五种不同语言和各种任务上进行多义词一致性测试。我们在一个受控的环境中进行评估,要求模型提供简单的事实,然后进行四个流行 NLU 基准的评估。我们发现,模型的多义词一致性缺乏,并进行多次后续分析来验证这一缺乏是否与感知相关。我们得出结论,在这一点上,LLM的理解仍然与人类理解相去甚远,并且需要特别关注如何影响其在学习关于人类语言及其理解方面的实用性。

URL

https://arxiv.org/abs/2404.12145

PDF

https://arxiv.org/pdf/2404.12145.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot