FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Abstract
Abstract (translated)
URL
PDF

Abstract

Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.

Abstract (translated)

近年来在大型预训练方面的进步导致开发了在多模态内容理解和解剖方面表现出众的先进视觉语言模型（VLMs）。尽管VLMs在复杂推理方面表现出色，但目前的模型通常很难有效地和精确地捕捉图像和文本两侧的组合信息。为了解决这个问题，我们提出了FineMatch，一个新的基于 aspects 的细粒度文本和图像匹配基准，重点关注文本和图像不匹配检测和纠正。这个基准为基于 aspects 的细粒度文本和图像匹配的 VLMs 的组合性评估引入了一个新的任务。在这个任务中，模型需要找出文本中的不匹配 aspects，确定 aspect 的类别，并针对可能包含 0 到 3 不匹配的图像-文本对提出修正。为了评估模型在新任务上的表现，我们提出了一个名为 ITM-IoU 的新评估指标，我们的实验结果表明它与人类评价高度相关。此外，我们还对现有的主流 VLMs 进行了全面的实验分析，包括完全监督学习和上下文学习场景。我们发现，在 FineMatch 上训练的模型在检测细粒度文本和图像不匹配方面表现更出色。此外，具有良好多模态上下文学习能力的模型（如 GPT-4V，Gemini Pro Vision）在细粒度组合图像和文本匹配分析方面并不熟练。通过 FineMatch，我们能够构建一个系统，用于检测文本到图像生成的幻觉，并进行修正。

URL

https://arxiv.org/abs/2404.14715

PDF

https://arxiv.org/pdf/2404.14715.pdf

FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Abstract

Abstract (translated)

URL

PDF Copy

PDF