From Form to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Abstract
Abstract (translated)
URL
PDF

Abstract

The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what "understanding" means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes - inspired by Fregean senses - of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model's multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.

Abstract (translated)

测量大型语言模型（LLMs）能力增加的速度，通过一系列常用的自然语言理解（NLU）基准进行衡量，引发了许多关于语言模型“理解”的问题，以及它与人类理解的比较。这在许多LLM仅基于文本训练的事实下更是如此，让人怀疑这些卓越的基准表现是否反映了真正理解这些问题，或者是否只是LLM在表达与理解问题相关的文本形式方面表现出色。在深受哲学启发的此项工作中，我们旨在创造一些形式与意义之间的区分，通过一系列利用 Fregean 感知的思想进行测试，其中包括跨语言的一致性以及同义词。具体来说，我们关注跨语言的一致性以及同义词。以GPT-3.5为研究对象，我们在五种不同语言和各种任务上进行多义词一致性测试。我们在一个受控的环境中进行评估，要求模型提供简单的事实，然后进行四个流行 NLU 基准的评估。我们发现，模型的多义词一致性缺乏，并进行多次后续分析来验证这一缺乏是否与感知相关。我们得出结论，在这一点上，LLM的理解仍然与人类理解相去甚远，并且需要特别关注如何影响其在学习关于人类语言及其理解方面的实用性。

URL

https://arxiv.org/abs/2404.12145

PDF

https://arxiv.org/pdf/2404.12145.pdf

From Form to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Abstract

Abstract (translated)

URL

PDF Copy

PDF