Paper Reading AI Learner

ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis

2024-04-15 21:19:10
Aashish Anantha Ramakrishnan, Sharon X. Huang, Dongwon Lee

Abstract

Text-to-Image (T2I) Synthesis has made tremendous strides in enhancing synthesized image quality, but current datasets evaluate model performance only on descriptive, instruction-based prompts. Real-world news image captions take a more pragmatic approach, providing high-level situational and Named-Entity (NE) information and limited physical object descriptions, making them abstractive. To evaluate the ability of T2I models to capture intended subjects from news captions, we introduce the Abstractive News Captions with High-level cOntext Representation (ANCHOR) dataset, containing 70K+ samples sourced from 5 different news media organizations. With Large Language Models (LLM) achieving success in language and commonsense reasoning tasks, we explore the ability of different LLMs to identify and understand key subjects from abstractive captions. Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights. It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR. By launching the ANCHOR dataset, we hope to motivate research in furthering the Natural Language Understanding (NLU) capabilities of T2I models.

Abstract (translated)

文本转图像(T2I)合成已经在提高合成图像质量方面取得了巨大的进步,但是现有的数据集仅评估模型在描述性、指令式提示上的性能。现实世界的新闻标题更务实,提供高级情境和命名实体(NE)信息,以及有限的物体描述,使它们具有抽象性。为了评估T2I模型从新闻标题中捕捉意图主题的能力,我们引入了抽象新闻标题高级上下文表示(ANCHOR)数据集,包含来自5个不同新闻媒体组织的70K+个样本。在大语言模型(LLM)在语言和常识推理任务中取得成功之后,我们研究了不同LLM从抽象性摘要中识别和理解关键主题的能力。我们提出的SAFE方法选择和增强了通过LLM生成的主题权重对合成图像中关键主题的表示。它还通过自定义领域微调适应新闻图像和摘要的领域分布,在ANCHOR数据集上优于当前的T2I基线。通过启动ANCHOR数据集,我们希望激励研究进一步提高T2I模型的自然语言理解(NLU)能力。

URL

https://arxiv.org/abs/2404.10141

PDF

https://arxiv.org/pdf/2404.10141.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot