Abstract
Large Vision-Language models (VLMs) have demonstrated strong reasoning capabilities in tasks requiring a fine-grained understanding of literal images and text, such as visual question-answering or visual entailment. However, there has been little exploration of these models' capabilities when presented with images and captions containing figurative phenomena such as metaphors or humor, the meaning of which is often implicit. To close this gap, we propose a new task and a high-quality dataset: Visual Figurative Language Understanding with Textual Explanations (V-FLUTE). We frame the visual figurative language understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a claim (hypothesis) and justify the predicted label with a textual explanation. Using a human-AI collaboration framework, we build a high-quality dataset, V-FLUTE, that contains 6,027 <image, claim, label, explanation> instances spanning five diverse multimodal figurative phenomena: metaphors, similes, idioms, sarcasm, and humor. The figurative phenomena can be present either in the image, the caption, or both. We further conduct both automatic and human evaluations to assess current VLMs' capabilities in understanding figurative phenomena.
Abstract (translated)
大视觉语言模型(VLMs)已经在需要对字面图像和文本进行深入理解的任务中表现出强大的推理能力,例如视觉问答或视觉蕴含。然而,在遇到包含象征性现象(如隐喻或幽默)的图像和字幕时,对这些模型的能力进行了深入的研究还是很少的。为了填补这一空白,我们提出了一个新的任务和高质量的数据集:视觉符号语言理解与文本解释(V-FLUTE)。我们将视觉符号语言理解问题视为一种可解释的视觉蕴含任务,其中模型需要预测图像(前提)是否符合一个假设(结论),并通过文本解释预测标签。利用人机合作框架,我们构建了一个高质量的数据集V-FLUTE,其中包括6,027个<图像,陈述,标签,解释>实例,涵盖了五种多样 multimodal 符号现象:隐喻、比喻、惯用语、讽刺和幽默。符号现象可以出现在图像中,描述中,或两者兼备。我们进一步进行了自动和人类评估,以评估现有 VLMs 对符号现象的理解能力。
URL
https://arxiv.org/abs/2405.01474