Paper Reading AI Learner

Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

2024-04-25 17:58:43
Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Chris Knutsen, Cyrus Rashtchian, Jordi Pont-Tuset, Aida Nematzadeh
   

Abstract

While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.

Abstract (translated)

尽管文本到图像(T2I)生成模型已经变得无处不在,但它们并不一定生成与给定提示相符的图像。之前的工作已经通过提出指标、基准和模板来评估T2I的准确性,但这些组件的质量和系统的评估并未进行系统性的测量。人类评分集通常较小,而且用于比较模型的提示集的可靠性并未进行评估。为了填补这个空白,我们通过评估自监督指标和人类模板来进行了广泛的研究。我们提供了三个主要贡献:(1)我们引入了一个全面技能为基础的基准,可以区分不同的人类模板中的模型。这个技能基准将提示分为子技能,使得实践者不仅可以确定哪些技能具有挑战性,而且还可以确定技能变得具有挑战性的程度。(2)我们收集了四个人类模板和四个T2I模型的所有人类评分,共计超过10万条注释。这使我们能够了解由于提示固有的歧义而产生的差异,以及由于指标和模型质量的差异而产生的差异。(3)最后,我们引入了一种新的基于问答的自监督指标,该指标比我们新数据集中的现有指标与人类评分之间的相关性更高。这种指标在不同的人类模板和TIFA160上都有所表现。

URL

https://arxiv.org/abs/2404.16820

PDF

https://arxiv.org/pdf/2404.16820.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot