Paper Reading AI Learner

V-FLUTE: Visual Figurative Language Understanding with Textual Explanations

2024-05-02 17:07:25
Arkadiy Saakyan, Shreyas Kulkarni, Tuhin Chakrabarty, Smaranda Muresan

Abstract

Large Vision-Language models (VLMs) have demonstrated strong reasoning capabilities in tasks requiring a fine-grained understanding of literal images and text, such as visual question-answering or visual entailment. However, there has been little exploration of these models' capabilities when presented with images and captions containing figurative phenomena such as metaphors or humor, the meaning of which is often implicit. To close this gap, we propose a new task and a high-quality dataset: Visual Figurative Language Understanding with Textual Explanations (V-FLUTE). We frame the visual figurative language understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a claim (hypothesis) and justify the predicted label with a textual explanation. Using a human-AI collaboration framework, we build a high-quality dataset, V-FLUTE, that contains 6,027 <image, claim, label, explanation> instances spanning five diverse multimodal figurative phenomena: metaphors, similes, idioms, sarcasm, and humor. The figurative phenomena can be present either in the image, the caption, or both. We further conduct both automatic and human evaluations to assess current VLMs' capabilities in understanding figurative phenomena.

Abstract (translated)

大视觉语言模型(VLMs)已经在需要对字面图像和文本进行深入理解的任务中表现出强大的推理能力,例如视觉问答或视觉蕴含。然而,在遇到包含象征性现象(如隐喻或幽默)的图像和字幕时,对这些模型的能力进行了深入的研究还是很少的。为了填补这一空白,我们提出了一个新的任务和高质量的数据集:视觉符号语言理解与文本解释(V-FLUTE)。我们将视觉符号语言理解问题视为一种可解释的视觉蕴含任务,其中模型需要预测图像(前提)是否符合一个假设(结论),并通过文本解释预测标签。利用人机合作框架,我们构建了一个高质量的数据集V-FLUTE,其中包括6,027个<图像,陈述,标签,解释>实例,涵盖了五种多样 multimodal 符号现象:隐喻、比喻、惯用语、讽刺和幽默。符号现象可以出现在图像中,描述中,或两者兼备。我们进一步进行了自动和人类评估,以评估现有 VLMs 对符号现象的理解能力。

URL

https://arxiv.org/abs/2405.01474

PDF

https://arxiv.org/pdf/2405.01474.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot