Abstract
Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at this https URL.
Abstract (translated)
尽管它们取得了惊人的成功,最先进的语言模型在理解某些重要的语义细节方面仍然面临着挑战。本文介绍了一个名为 VISLA(语义和词义变化对齐)的基准,旨在评估语言模型的语义和词义理解能力。VISLA 提出了一种三元语义(形式)等价任务,三元组句子与图像相关联,以评估视觉语言模型(VLMs)和单语语言模型(ULMs)的语义和词义理解能力。在评估 34 个 VLMs 和 20 个 ULMs 的过程中,揭示了在区分词汇和语义变化方面令人惊讶的困难。语言模型编码的空间语义也似乎对词汇信息非常敏感。值得注意的是,VLMs 的文本编码器在语义和词义变化方面比单语文本编码器更加敏感。我们的贡献包括统一图像到文本和文本到文本检索任务,无需微调即可进行一般评估,以及评估 LMs 在词汇变化影响下的语义(形式)变化。结果表明,各种视觉和单语语言模型的优势和劣势得到了突出,有助于更深入地了解它们的能力。VISLA 为严格的评估提供了一个框架,揭示了语言模型在处理语义和词义细微差别方面的能力。数据和代码将在此链接的 URL 上提供。
URL
https://arxiv.org/abs/2404.16365