Paper Reading AI Learner

Towards Equitable Representation in Text-to-Image Synthesis Models with the Cross-Cultural Understanding Benchmark Dataset

2023-01-28 03:10:33
Zhixuan Liu, Youeun Shin, Beverley-Claire Okogwu, Youngsik Yun, Lia Coleman, Peter Schaldenbrand, Jihie Kim, Jean Oh

Abstract

It has been shown that accurate representation in media improves the well-being of the people who consume it. By contrast, inaccurate representations can negatively affect viewers and lead to harmful perceptions of other cultures. To achieve inclusive representation in generated images, we propose a culturally-aware priming approach for text-to-image synthesis using a small but culturally curated dataset that we collected, known here as Cross-Cultural Understanding Benchmark (CCUB) Dataset, to fight the bias prevalent in giant datasets. Our proposed approach is comprised of two fine-tuning techniques: (1) Adding visual context via fine-tuning a pre-trained text-to-image synthesis model, Stable Diffusion, on the CCUB text-image pairs, and (2) Adding semantic context via automated prompt engineering using the fine-tuned large language model, GPT-3, trained on our CCUB culturally-aware text data. CCUB dataset is curated and our approach is evaluated by people who have a personal relationship with that particular culture. Our experiments indicate that priming using both text and image is effective in improving the cultural relevance and decreasing the offensiveness of generated images while maintaining quality.

Abstract (translated)

研究表明,媒体准确的表现形式能够改善消耗它的人的健康。相比之下,不准确的表现形式可能会负面影响观众,导致对其他文化的有害感知。为了实现在生成图像中的包容表现,我们提出了一种文化意识的prime方法,使用我们收集的一个小但经过文化校正的数据集,即Cross-Cultural Understanding Benchmark (CCUB) Dataset,以对抗大型数据集中存在的偏见。我们提出的方法是由两个微调技术组成的:(1)通过微调预先训练的文本到图像合成模型稳定扩散,在CCUB文本图像对上添加视觉上下文;(2)通过使用自动化prompt engineering方法,使用我们CCUB文化意识的文本数据训练的微调大型语言模型GPT-3,添加语义上下文。CCUB数据集是经过 curated 的,我们的方法和由与该特定文化有 personal 关系的人进行评估。我们的实验表明,使用文本和图像的prime方法,能够在保持质量的同时改善文化相关性,减少生成图像的攻击性。

URL

https://arxiv.org/abs/2301.12073

PDF

https://arxiv.org/pdf/2301.12073.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot