Abstract
Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.
Abstract (translated)
尽管在生成式 AI方面取得了显著的进展,但全面评估仍然具有挑战性,原因在于缺乏有效的指标和标准化的基准。例如,广泛使用的 CLIPScore 测量了生成图像与文本提示之间的对齐程度,但它无法产生关于包含物体、属性和关系等复杂提示的可靠分数。一个原因是 CLIP 的文本编码器经常被视为一个“单词集合”,将诸如“马正在吃草”这样的提示与“草正在吃马”这样的提示混淆。为了解决这个问题,我们引入了 VQAScore,它使用视觉问答(VQA)模型通过计算“是的”回答的概率来生成对齐分数。尽管 VQAScore 比先前的技术更简单,但它使用的普通模型在许多图像文本对齐基准测试中都产生了最先进的成果。我们还使用一种遵循最佳实践的内部模型来计算 VQAScore。例如,我们使用一种双向图像-问题编码器,其中图像嵌入允许取决于提出的问题(反之亦然)。我们的内部模型 CLIP-FlanT5 甚至超过了使用专有 GPT-4V 的最强大的基线。有趣的是,尽管我们只使用图像进行训练,但 VQAScore 也可以将文本与视频和 3D 模型对齐。VQAScore 使研究人员能够通过捕捉现实世界提示的构成结构来比较文本到视觉生成。我们引入了 GenAI-Bench,一种更具挑战性的基准,含有1600个具有解析场景、物体、属性、关系和高阶推理(比较和逻辑)的复杂文本提示。GenAI-Bench 还提供了超过15,000个用户评分,用于评估 Stable Diffusion、DALL-E 3 和 Gen2 等领先图像和视频生成模型。
URL
https://arxiv.org/abs/2404.01291