Paper Reading AI Learner

FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

2024-04-23 03:42:14
Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, Jiebo Luo

Abstract

Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.

Abstract (translated)

近年来在大型预训练方面的进步导致开发了在多模态内容理解和解剖方面表现出众的先进视觉语言模型(VLMs)。尽管VLMs在复杂推理方面表现出色,但目前的模型通常很难有效地和精确地捕捉图像和文本两侧的组合信息。为了解决这个问题,我们提出了FineMatch,一个新的基于 aspects 的细粒度文本和图像匹配基准,重点关注文本和图像不匹配检测和纠正。这个基准为基于 aspects 的细粒度文本和图像匹配的 VLMs 的组合性评估引入了一个新的任务。在这个任务中,模型需要找出文本中的不匹配 aspects,确定 aspect 的类别,并针对可能包含 0 到 3 不匹配的图像-文本对提出修正。为了评估模型在新任务上的表现,我们提出了一个名为 ITM-IoU 的新评估指标,我们的实验结果表明它与人类评价高度相关。此外,我们还对现有的主流 VLMs 进行了全面的实验分析,包括完全监督学习和上下文学习场景。我们发现,在 FineMatch 上训练的模型在检测细粒度文本和图像不匹配方面表现更出色。此外,具有良好多模态上下文学习能力的模型(如 GPT-4V,Gemini Pro Vision)在细粒度组合图像和文本匹配分析方面并不熟练。通过 FineMatch,我们能够构建一个系统,用于检测文本到图像生成的幻觉,并进行修正。

URL

https://arxiv.org/abs/2404.14715

PDF

https://arxiv.org/pdf/2404.14715.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot