Paper Reading AI Learner

Transferring Visual Attributes from Natural Language to Verified Image Generation

2023-05-24 11:08:26
Rodrigo Valerio, Joao Bordalo, Michal Yarom, Yonattan Bitton, Idan Szpektor, Joao Magalhaes

Abstract

Text to image generation methods (T2I) are widely popular in generating art and other creative artifacts. While visual hallucinations can be a positive factor in scenarios where creativity is appreciated, such artifacts are poorly suited for cases where the generated image needs to be grounded in complex natural language without explicit visual elements. In this paper, we propose to strengthen the consistency property of T2I methods in the presence of natural complex language, which often breaks the limits of T2I methods by including non-visual information, and textual elements that require knowledge for accurate generation. To address these phenomena, we propose a Natural Language to Verified Image generation approach (NL2VI) that converts a natural prompt into a visual prompt, which is more suitable for image generation. A T2I model then generates an image for the visual prompt, which is then verified with VQA algorithms. Experimentally, aligning natural prompts with image generation can improve the consistency of the generated images by up to 11% over the state of the art. Moreover, improvements can generalize to challenging domains like cooking and DIY tasks, where the correctness of the generated image is crucial to illustrate actions.

Abstract (translated)

文本到图像生成方法(T2I)在生成艺术和其他创意产物方面非常流行。虽然视觉幻觉可能在欣赏创造力的情况下是一种积极的因素,但在这种情况下,生成图像需要基于复杂的自然语言,而这种方法常常包括非视觉信息以及需要准确生成知识的文本元素,这使得这种方法并不适合生成图像,特别是当需要生成图像时,它们没有明确的视觉元素。在本文中,我们提议加强T2I方法在自然复杂语言中的一致性性质,这种语言常常包括非视觉信息并突破T2I方法的极限,以包括需要准确生成知识的文本元素。为了解决这些问题,我们提出了一种自然语言到验证图像生成方法(NL2VI),将自然提示转换为视觉提示,更适合于图像生成。T2I模型随后生成针对视觉提示的图像,然后使用VQA算法进行验证。实验表明,将自然提示与图像生成对齐可以改进生成图像的一致性,比当前技术水平提高了11%。此外,这些改进可以扩展到挑战性的领域,例如烹饪和 DIY任务,其中正确生成的图像对于演示行动至关重要。

URL

https://arxiv.org/abs/2305.15026

PDF

https://arxiv.org/pdf/2305.15026.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot