Abstract
Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.
Abstract (translated)
指称表达生成(REG)是评估视觉-语言系统实用能力的核心任务,不仅要求准确的语义定位,还要求遵循合作交流的原则(Grice, 1975)。然而,目前对视觉-语言模型(VLMs)的评价往往忽视了实用维度,将REG简化为基于区域的文字描述任务,并忽略了格赖斯准则。在这项工作中,我们从实用角度重新审视REG,并引入了一个新的数据集RefOI,该数据集中包含1.5k幅图像,这些图像是用书面和口语化的指称表达进行标注的。通过系统性地评估最先进的VLMs,我们发现了三个关键性的实用能力缺陷:(1) 无法唯一确定所指的是哪个对象;(2) 包含过多或无关的信息;以及(3) 违背了人类的实用偏好,如过度使用非必要的空间线索。 此外,我们还表明标准的自动评价方法未能捕捉到这些实用层面的问题,反而更倾向于重视表面线索而不是真正的指称成功。我们的研究结果呼吁对基于实用信息建模的方法和评估框架给予更多的关注,使其更加符合真实的人类交流需求。
URL
https://arxiv.org/abs/2504.16060