Paper Reading AI Learner

Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans

2024-04-23 10:09:46
Vittoria Dentella, Fritz Guenther, Evelina Leivada

Abstract

Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N=1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n=80 humans on the same stimuli. We find that increased model size may lead to better performance, but LLMs are still not sensitive to (un)grammaticality as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.

Abstract (translated)

理解语言的局限性是大型语言模型(LLMs)作为自然语言理论的先决条件。 LLM在某些语言任务中的表现与人类相比存在定量和定性差异,然而目前尚不清楚是否这些差异可以通过模型规模得到缩小。这项工作研究了模型规模对模型作用的影响,确定增加模型规模是否可以弥补人类和模型之间的差异。我们在包含三种不同家族的 LLM(Bard,1370亿参数;ChatGPT-3.5,1750亿;ChatGPT-4,1.5万亿)的语义正确性判断任务上进行了测试,该任务包括同位语、中心嵌入、比较和负极性极性。收集了1,200个判断,为准确性、稳定性以及重复呈现提示后准确性的提高进行评分。最佳表现的 LLM ChatGPT-4 的结果与相同刺激下的人类 N=80 个个体的结果进行了比较。我们发现,增加模型规模可能会有更好的表现,但 LLMs 对(不)语法正确性仍然不如人类。似乎通过单独调整规模无法解决此问题。我们通过比较在实境和虚拟语言学习中语言学习来解释这些结果,并指出了三个关键差异,即(i)证据类型,(ii)刺激的贫困,(iii)由于无法理解的语义参考而出现的语义幻觉现象。

URL

https://arxiv.org/abs/2404.14883

PDF

https://arxiv.org/pdf/2404.14883.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot