Abstract
Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N=1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n=80 humans on the same stimuli. We find that increased model size may lead to better performance, but LLMs are still not sensitive to (un)grammaticality as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.
Abstract (translated)
理解语言的局限性是大型语言模型(LLMs)作为自然语言理论的先决条件。 LLM在某些语言任务中的表现与人类相比存在定量和定性差异,然而目前尚不清楚是否这些差异可以通过模型规模得到缩小。这项工作研究了模型规模对模型作用的影响,确定增加模型规模是否可以弥补人类和模型之间的差异。我们在包含三种不同家族的 LLM(Bard,1370亿参数;ChatGPT-3.5,1750亿;ChatGPT-4,1.5万亿)的语义正确性判断任务上进行了测试,该任务包括同位语、中心嵌入、比较和负极性极性。收集了1,200个判断,为准确性、稳定性以及重复呈现提示后准确性的提高进行评分。最佳表现的 LLM ChatGPT-4 的结果与相同刺激下的人类 N=80 个个体的结果进行了比较。我们发现,增加模型规模可能会有更好的表现,但 LLMs 对(不)语法正确性仍然不如人类。似乎通过单独调整规模无法解决此问题。我们通过比较在实境和虚拟语言学习中语言学习来解释这些结果,并指出了三个关键差异,即(i)证据类型,(ii)刺激的贫困,(iii)由于无法理解的语义参考而出现的语义幻觉现象。
URL
https://arxiv.org/abs/2404.14883