Paper Reading AI Learner

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

2025-04-22 17:37:16
Ziqiao Ma, Jing Ding, Xuejun Zhang, Dezhi Luo, Jiahe Ding, Sihan Xu, Yuchen Huang, Run Peng, Joyce Chai

Abstract

Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.

Abstract (translated)

指称表达生成(REG)是评估视觉-语言系统实用能力的核心任务,不仅要求准确的语义定位,还要求遵循合作交流的原则(Grice, 1975)。然而,目前对视觉-语言模型(VLMs)的评价往往忽视了实用维度,将REG简化为基于区域的文字描述任务,并忽略了格赖斯准则。在这项工作中,我们从实用角度重新审视REG,并引入了一个新的数据集RefOI,该数据集中包含1.5k幅图像,这些图像是用书面和口语化的指称表达进行标注的。通过系统性地评估最先进的VLMs,我们发现了三个关键性的实用能力缺陷:(1) 无法唯一确定所指的是哪个对象;(2) 包含过多或无关的信息;以及(3) 违背了人类的实用偏好,如过度使用非必要的空间线索。 此外,我们还表明标准的自动评价方法未能捕捉到这些实用层面的问题,反而更倾向于重视表面线索而不是真正的指称成功。我们的研究结果呼吁对基于实用信息建模的方法和评估框架给予更多的关注,使其更加符合真实的人类交流需求。

URL

https://arxiv.org/abs/2504.16060

PDF

https://arxiv.org/pdf/2504.16060.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot