Paper Reading AI Learner

Evaluating Text-to-Visual Generation with Image-to-Text Generation

2024-04-01 17:58:06
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan

Abstract

Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.

Abstract (translated)

尽管在生成式 AI方面取得了显著的进展,但全面评估仍然具有挑战性,原因在于缺乏有效的指标和标准化的基准。例如,广泛使用的 CLIPScore 测量了生成图像与文本提示之间的对齐程度,但它无法产生关于包含物体、属性和关系等复杂提示的可靠分数。一个原因是 CLIP 的文本编码器经常被视为一个“单词集合”,将诸如“马正在吃草”这样的提示与“草正在吃马”这样的提示混淆。为了解决这个问题,我们引入了 VQAScore,它使用视觉问答(VQA)模型通过计算“是的”回答的概率来生成对齐分数。尽管 VQAScore 比先前的技术更简单,但它使用的普通模型在许多图像文本对齐基准测试中都产生了最先进的成果。我们还使用一种遵循最佳实践的内部模型来计算 VQAScore。例如,我们使用一种双向图像-问题编码器,其中图像嵌入允许取决于提出的问题(反之亦然)。我们的内部模型 CLIP-FlanT5 甚至超过了使用专有 GPT-4V 的最强大的基线。有趣的是,尽管我们只使用图像进行训练,但 VQAScore 也可以将文本与视频和 3D 模型对齐。VQAScore 使研究人员能够通过捕捉现实世界提示的构成结构来比较文本到视觉生成。我们引入了 GenAI-Bench,一种更具挑战性的基准,含有1600个具有解析场景、物体、属性、关系和高阶推理(比较和逻辑)的复杂文本提示。GenAI-Bench 还提供了超过15,000个用户评分,用于评估 Stable Diffusion、DALL-E 3 和 Gen2 等领先图像和视频生成模型。

URL

https://arxiv.org/abs/2404.01291

PDF

https://arxiv.org/pdf/2404.01291.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot